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ABSTRACT 


The  use  of  lattice  Boltzmann  methods  (LBM)  for  fluid  flow  and  its  coupling  with  finite  el¬ 
ement  method  (FEM)  structural  models  for  fluid-structure  interaction  (ESI)  is  investigated. 
A  body  of  high  performance  EBM  software  that  exploits  graphic  processing  unit  (GPU)  and 
multiprocessor  programming  models  is  developed  and  validated  against  a  set  of  two-  and 
three-dimensional  benchmark  problems.  Computational  performance  is  shown  to  exceed 
recently  reported  results  for  single-workstation  implementations  over  a  range  of  problem 
sizes.  A  mixed-precision  EBM  collision  algorithm  is  presented  that  retains  the  accuracy  of 
double-precision  calculations  with  less  computational  cost  than  a  full  double-precision  im¬ 
plementation.  ESI  modelling  methodology  and  example  applications  are  presented  along 
with  a  novel  heterogeneous  parallel  implementation  that  exploits  task-level  parallelism  and 
workload  sharing  between  the  central  processing  unit  (CPU)  and  GPU  that  allows  signifi¬ 
cant  speedup  over  other  methods.  Multi-component  EBM  fluid  models  are  explicated  and 
simple  immiscible  multi-component  fluid  flows  in  two-dimensions  are  presented.  These 
multi-component  fluid  EBM  models  are  also  paired  with  structural  dynamics  solvers  for 
two-dimensional  ESI  simulations.  To  enhance  modeling  capability  for  domains  with  com¬ 
plex  surfaces,  a  novel  coupling  method  is  introduced  that  allows  use  of  both  classical  EBM 
(CEBM)  and  a  finite  element  EBM  (EEEBM)  to  be  combined  into  a  hybrid  EBM  that 
exploits  the  flexibility  of  EEEBM  while  retaining  the  efficiency  of  CEBM. 
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I.  INTRODUCTION 


The  present  thesis  investigates  the  use  of  the  lattice  Boltzmann  method  (LBM)  to 
solve  for  the  flow  of  viscous,  incompressible  fluids  while  accounting  for  the  effect  these 
fluid  flows  have  on  surrounding  elastic  structures.  From  waves  slapping  against  ship  struc¬ 
tural  members,  cooling  water  passing  over  heat-exchanger  tubes  to  blood  flowing  within 
veins  and  arteries  among  many  others,  FSI  has  applications  that  span  the  engineering  and 
life  sciences.  Towards  the  goal  of  simulating  such  physical  behavior,  several  intermediate 
steps  were  required.  These  intermediate  steps  start  with  the  development  of  a  robust  and 
highly  capable  software  tool  for  simulating  and  analyzing  flow  of  incompressible  viscous 
fluids  and  continue  through  to  integration  of  these  tools  with  structural  dynamics  solvers. 

A.  OBJECTIVES  AND  ORGANIZATION 

The  first  objective  is  to  develop  software  tools  required  for  a  LBM  flow  solver 
that  can  reliably  and  accurately  simulate  fluid  flow  problems  of  interest.  The  theory  and 
formulation  of  the  LBM,  including  detailed  considerations  of  stability,  accuracy  and  proper 
scaling  of  simulation  variables  to  allow  modeling  of  specific  fluid  systems  is  provided  in 
Chapter  11. 

With  a  detailed  understanding  of  the  theory,  software  tools  can  be  developed  to 
simulate  fluid  flows  of  interest.  Such  software  is  developed  and  subjected  to  a  collection 
of  validation  benchmarks  in  Chapter  III.  The  second-order  convergence  of  the  LBM  along 
with  select  boundary  conditions  is  demonstrated  along  with  a  demonstration  of  the  potential 
impacts  of  the  wide  use  of  single-precision  arithmetic  on  this  convergence  rate.  A  mixed- 
precision  LBM  implementation  is  introduced  that  provides  for  second-order  convergence 
for  select  boundary  conditions  while  retaining  some  of  the  performance  benefits  of  using 
single  precision  number  representation  and  arithmetic. 
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In  order  to  increase  the  accuracy  of  a  LBM  simulation,  the  spatial  and  temporal  dis¬ 
cretization  is  refined.  Smaller  time  steps  and  a  more  refined  lattice  both  lead  to  increased 
computational  demand.  In  order  to  execute  LBM  simulations  in  a  reasonable  amount  of 
time,  the  programs  are  written  to  be  executed  in  parallel.  In  Chapter  IV  the  Compute  Uni¬ 
fied  Device  Architecture  (CUDA)  programming  model  is  introduced  and  the  implementa¬ 
tion  of  the  LBM  programs  for  parallel  execution  on  a  graphical  processing  unit  (GPU)  is 
described.  The  achieved  performance  is  compared  with  recently  published  benchmarks. 
Two  hybrid  programming  models  are  also  demonstrated  that  use  a  combination  of  CUDA 
and  OpenMP  in  one  case  and  CUDA  and  message  passing  interface  (MPI)  in  another. 
Collectively,  Chapter  IV  describes  the  development  of  a  scalable  high-performance  LBM 
solver. 

Once  reliable  fluid  simulation  tools  are  in  place,  the  second  objective  is  to  couple 
this  fluid  solver  with  an  appropriate  structural  dynamics  model.  These  software  compo¬ 
nents  are  then  used  as  coupled  FSI  simulation  tools.  The  key  ingredients  of  computing 
forces  and  moments  along  the  fluid- structure  interface  and  accounting  for  their  exchange 
and  integrating  with  a  structural  dynamics  solver  for  a  coordinated  FSI  simulation  are  dis¬ 
cussed.  The  specific  methods  and  algorithms  for  doing  this  along  with  example  applications 
in  both  two  and  three  dimensions  are  presented  in  Chapter  V. 

The  aforementioned  software  tools  were  all  developed  based  on  the  classical  LBM 
theory.  A  recently  developed  modification,  termed  the  finite  element  lattice  Boltzmann 
method  (FELBM),  allows  use  of  unstructured  non-uniform  finite  element  grids  in  lieu  of 
the  regular  structured  grid  from  classical  lattice  Boltzmann  method  (CLBM).  The  FELBM 
gains  this  flexibility  at  the  expense  of  some  computational  efficiency  on  a  per-lattice-point 
basis.  An  algorithm  and  associated  software  tool  has  been  developed  resulting  in  a  hybrid 
lattice  Boltzmann  method  (HLBM)  whereby  both  CLBM  and  EELBM  are  used  on  disjoint 
sub-domains  of  an  overall  simulation.  This  hybrid  tool  leverages  the  simplicity  and  effi¬ 
ciency  of  the  CLBM  while  also  benefiting  from  the  geometric  flexibility  of  the  EELBM. 
The  overall  system  is  able  to  simulate  fluid  flow  over  the  combined  domain  with  less  com- 
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putational  effort  than  the  FELBM  while  reducing  the  memory  requirements  of  an  equiv¬ 
alent  simulation  using  only  CLBM.  The  theory  and  algorithms  for  accomplishing  this  is 
provided  in  Chapter  VI. 

The  last  objective  of  this  model  is  to  take  advantage  of  the  flexible  and  physically 
intuitive  methods  for  modeling  multi-component  fluid  systems  using  LBM.  A  discussion 
of  the  standard  LBM  theory  for  multi-component  fluids  as  well  as  example  problems  in 
fluid  flow  and  LSI  are  demonstrated  in  Chapter  VII. 

In  Chapter  VIII,  conclusions  and  prospects  for  future  research  are  discussed. 

B.  STATEMENT  OF  CONTRIBUTIONS 

The  principal  contributions  of  this  thesis  are: 

•  A  body  of  software  tools  that  provide  LBM  modeling  capability  for  single¬ 
component  incompressible  viscous  flows  in  two  and  three  dimensions. 

•  A  new  mixed-precision  LBM  implementation  that  retains  most  of  the  ac¬ 
curacy  of  double  precision  while  requiring  only  the  memory  of  single  pre¬ 
cision. 

•  A  GPU- accelerated  implementation  of  LBM  with  highly  competitive  per¬ 
formance  against  recently  published  benchmarks. 

•  A  new  method  to  simulate  two-way  LSI  that  exploits  task-level  concur¬ 
rency  for  a  heterogeneous-parallel  algorithm  using  both  GPU  for  the  fluid 
domain  and  central  processing  units  for  the  solid  domain. 

•  A  novel  method  for  combining  CLBM  and  LELBM  into  a  HLBM. 

•  LBM  flow  and  LSI  modeling  capability  for  multi-component  flows  in  two 
dimensions. 
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II.  LATTICE  BOLTZMANN  METHOD 


The  LBM  is  an  increasingly  popular  way  to  simulate  fluid  flow.  In  contrast  to 
more  conventional  methods  such  as  finite  difference  methods  (FDM)  control  volume  meth¬ 
ods  (CVM)  and  finite  element  methods  (FEM),  the  LBM  begins  not  with  the  picture  of 
the  fluid  as  a  continuous  medium,  but  instead  as  a  collection  of  particles.  These  parti¬ 
cles  move  and  undergo  local  interactions  with  other  particles  in  accordance  with  simple 
rules.  Macroscopic  physical  phenomena  such  as  those  conservation  laws  described  by  the 
Navier-Stokes  equations  emerge  from  the  large  number  of  these  local  interactions.  The 
microscopic  level  of  description  provides  an  intuitive  basis  for  generalization  to  complex 
systems  such  as  porous  media  ([l]-[3]),  two-phase  flow  ([4]-[6])  and  magnetohydrodynam¬ 
ics  ([7]-[9])  among  others.  By  judiciously  altering  the  formulation,  other  partial  differen¬ 
tial  equations  of  interest  have  been  modeled  by  a  similar  procedure  including  the  Burgers 
Equation  [10],  the  Korteweg-de  Vries  equation  [11],  the  Brinkman  equation  [12]  and  the 
Schrodinger  equation  [13].  Eor  a  concise  review  of  the  current  state  of  the  art  in  LBM,  an 
excellent  survey  can  be  found  in  [14]  with  a  recent  update  in  [15].  A  recently  published 
analysis  of  LBM  theory,  which  includes  a  thorough  critique  and  comparison  with  tradi¬ 
tional  computational  fluid  dynamics  techniques,  can  be  found  at  [16].  In  this  work,  the 
LBM  will  be  used  for  the  solution  of  the  Navier-Stokes  equations  for  single-component 
fluid  flows  as  well  as  a  limited  number  of  multi-component  fluid  flows. 

This  chapter  will  start  with  a  brief  overview  of  the  historical  development  of  the 
LBM.  This  will  be  followed  by  a  description  of  each  element  of  a  LBM  simulation  in¬ 
cluding  typical  lattice  structures  with  lattice  velocities  and  associated  weights,  collision 
operators,  boundary  conditions,  body  forces  and  scaling  requirements.  The  chapter  will 
be  concluded  with  a  detailed  example  application  of  the  LBM  to  two-dimensional  channel 
flow  over  a  cylindrical  obstacle. 
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A. 


LITERATURE  REVIEW  AND  INTRODUCTION 


Historically,  the  LBM  is  derived  from  the  concepts  of  the  cellular  automaton  [17], 
[18].  Space  is  described  by  a  regular  array  of  interconnecting  lattice  sites  and  time  is  di¬ 
vided  into  equally  spaced  time-steps.  The  cellular  automata  model  is  specified  by  stating 
the  rules  by  which  each  lattice  site  shall  be  updated  for  the  next  time  step.  Depending  on 
these  rules,  complex  physical  phenomena  emerge  [19].  A  classical  example  of  complex  be¬ 
havior  emerging  from  simple  rules  is  Conway’s  Game  of  Life.  In  some  cases,  the  CA  with 
associated  sets  of  update  rules  have  become  useful  as  a  model  for  real  physical  behavior 
and  have  become  a  means  to  gaining  more  fundamental  understanding.  Examples  include 
traffic  flow  [20],  population  dynamics  [21],  and  earthquake  prediction  [22]  to  name  but  a 
few. 

The  CA  model  underlying  a  fluid  dynamics  model  incorporates  movement  of  par¬ 
ticles  from  one  lattice  site  to  another  along  discrete  lattice  directions.  The  rule  for  lattice 
site  update  is  applied  to  all  particles  arriving  at  a  given  lattice  site  in  a  given  time  step  and 
is  represented  formally  in  Equation  1 , 

Na  (x  -f  5^ea,  t  +  5t)  =  Na  (x,  t)  +  ^la{N)  (1) 

where  Na  is  a  Boolean  variable  indicating  the  presence  or  absence  of  a  fluid  particle  trav¬ 
eling  along  lattice  direction  at  position  x.  The  rules  for  update — referred  to  as  the 
Collision  Operator — are  formally  denoted  by  One  advantage  of  this  formulation 

using  Boolean  variables  is  the  absence  of  round-off  errors;  all  arithmetic  is  exact.  Unfor¬ 
tunately,  though  the  mathematical  operations  are  simple  and  exact,  it  has  been  found  that 
they  are  required  in  enormous  numbers  to  overcome  statistical  noise  in  the  results.  Ad¬ 
ditionally,  it  has  been  found  that  further  lattice  symmetry  requirements  need  to  be  met  in 
order  to  provide  Galilean  invariance. 

The  EBM  emerged  from  the  solutions  presented  to  these  difficulties  [18],  [23].  The 
EBM  seeks  to  solve  the  discrete  Boltzmann  equation  which,  in  the  absence  of  external 
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forces  is: 


dfa 

dt 


+  Ga  ■  V/a 


,  tt  G  [0, . . .  1  q\  ,  Ofj  G 


(2) 


where  fa  is  the  partiele  veloeity  distribution  funetion  for  lattiee  direetion  a;  Oq,  is  the  set  of 
lattiee  veloeities;  and  is  the  eollision  operator.  Additionally,  initial  values  for  all  fa  must 
be  supplied  on  the  problem  domain  and  boundary  conditions  must  be  applied  appropriately. 
In  the  following  sections,  each  of  these  issues  will  be  addressed  in  turn  so  that  a  simulation 
of  fluid  flow  may  be  undertaken  using  the  LBM. 


B.  LATTICE  STRUCTURES 


In  the  LBM,  this  solution  is  sought  on  a  regular  lattiee.  A  lattiee  is  defined  by  a 
sound  speed  Cg,  a  set  of  d-dimensional  lattiee  veloeities  where  a  G  [0, . . . ,  g]  and  a  set 
of  weights  Wa-  The  usual  notation  to  specify  a  lattice  is  given  as  DdQq.  A  lattiee  eommonly 
used  in  two  dimensions  has  nine  velocities  and  is  denoted  D2Q9  and  is  illustrated  in  Figure 
1.  The  sound  speed,  weights  and  lattice  veloeities  for  this  model  are  given  in  Equation  3. 


Figure  1:  The  D2Q9  lattiee. 
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Commonly  used  lattices  for  three-dimensional  problems  are  shown  in  Figure  2.  Sets 
of  lattice  speeds  are  given  for  the  DSQlh  lattice  are  given  in  Equation  4,  and  those  for  the 
D3Q19  and  D3Q27  are  shown  respectively  in  Equations  5  and  6. 


D3Q15  Lattice 


D3Q19  Lotties  D3Q27  Lattice 


Eigure  2:  Commonly  used  lattice  topologies  for  EBM  in  three  dimensions. 
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The  values  of  c*,  Wa  and  the  vectors  are  all  selected  so  as  to  satisfy  a  set  of 
symmetry  conditions  given  in  Equation  7. 


=  1 

Ey/^  '^a^ai  0 

^ cx.^ai^ai^ak^al^cx.m  0 

(7) 


These  symmetry  conditions  play  an  important  roll  in  the  theory  of  LBM.  In  particular,  they 
are  needed  to  show  the  correspondence  between  the  LBM  and  the  incompressible  Navier- 
Stokes  equation.  It  can  be  shown  that  all  of  the  lattices  introduced  satisfy  these  conditions. 

The  discrete  Boltzmann  Equation  shown  in  Equation  2  is  in  the  general  form  of  an 
advection  equation.  The  momentum  space  is  discretized  along  the  q  lattice  speeds  which, 
with  the  advection  equation  analogy,  are  the  characteristic  speeds.  The  right  hand  side  of 
Equation  2  is  the  collision  operator  which  determines  what  happens  to  the  particle  pop¬ 
ulations  fa  as  they  traverse  the  lattice  in  their  respective  characteristic  directions.  Instead 
of  numerically  integrating  the  temporal  and  spatial  derivative  operators,  the  EBM  handles 
them  discretely  in  time  and  space  by  “streaming”  particle  distributions  from  a  source  lattice 
site  to  neighboring  lattice  site  in  each  direction.  This  process  is  illustrated  schematically 
with  the  arrows  in  in  Eigure  1  and  is  formally  expressed  in  Equation  8  where  r  is  the  posi¬ 
tion  vector  for  a  given  lattice  point  and  t  is  the  current  time  in  lattice  units. 

fa{r  +  ea,t  +  l)-fa{r,t)  =  na.  (8) 

The  simplest  and  most  popular  form  for  the  collision  operator  is  the  Bhatnagar- 
Gross-Krook  (BGK),  shown  in  Equation  9,  which  gives  a  single-parameter  relaxation  to 
equilibrium: 


(9) 


where  r  is  a  relaxation  parameter,  is  a  function  of  the  macroscopic  parameters  of  the 
fluid  represented  by  /„  given  by  Equation  10. 


fa  =  PWa 


1  + 


(Oo  ■  u)  ■  u) 

O  A 


1  (u  ■  u) 

2  c? 


(10) 


The  macroscopic  variables  of  fluid  density  and  velocity,  given  by  p  and  u,  respectively,  are 


10 


computed  as  moments  of  the  particle  distribution  function  fa  as  shown  in  Equations  1 1  and 
12.  When  required  for  physical  modeling,  the  fluid  pressure  p  can  also  be  obtained  from 
Equation  13. 

q-l 

q;=0 

1 

u  =  -  V]  faGa  (12) 

P  =  pel  (13) 

The  relaxation  parameter  r  can  be  related  to  the  fluid  kinematic  viscosity  v.  This  relation¬ 
ship  is  given  in  Equation  14  with  all  units  expressed  in  lattice  units. 

^  ^  ^  (W) 

Erequently  in  the  literature,  and  periodically  in  this  thesis,  the  inverse  of  r  is  used 
as  the  relaxation  parameter  and  is  conventionally  named  u.  Since  for  real  fluids,  v  must 
be  non-negative,  r  is  constrained  to  be  greater  than  or  equal  to  1.  In  notation  employing 
(jj,  this  implies  0  <  a;  <  2.  In  a  later  section  of  this  thesis  where  dimensional  scaling 
and  stability  are  discussed,  it  will  be  demonstrated  that  the  numerical  stability  of  any  given 
EBM  simulation  can  be  characterized  by  the  value  of  r  or  u.  Systems  where  the  combined 
fluid  properties,  boundary  conditions  and  EBM  spatial  and  temporal  discretizations  result  in 
the  value  of  uj  to  be  close  to  2,  or  conversely  r  approaching  tend  to  become  numerically 
unstable. 

C.  MULTIPLE  RELAXATION  TIME  COLLISION  OPERATOR 

While  the  single  relaxation  time  lattice  Bhatnagar-Gross-Krook  (EBGK)  operator 
is  easy  to  implement  and  computationally  efficient  within  the  context  of  a  single  EBM  time 
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step,  it  is  known  to  suffer  from  severe  stability  problems.  When  these  stability  problems 
can  be  overcome  while  still  using  LBGK,  it  is  often  only  obtained  at  the  expense  of  an 
increase  in  lattice  density  and  hence,  increase  in  computational  effort.  The  LBGK  has  other 
deficiencies  including  an  implied  fixed  Prandtl  number  of  one  and  a  fixed  ratio  between 
kinematic  and  bulk  viscosity.  In  order  to  provide  a  means  for  tuning  the  stability  properties 
of  a  given  simulation  while  also  a  mechanism  for  altering  more  specific  fluid  properties, 
alternative  collision  operators  have  been  developed. 

The  multiple  relaxation  time  (MRT)  collision  operator,  also  referred  to  as  the  gener¬ 
alized  lattice  Boltzmann  equation,  was  first  presented  in  [24].  Its  objectives  were  to  resolve 
the  fixed  Prandtl  number  defect  of  LBGK,  and  allow  for  varying  kinematic  and  bulk  vis¬ 
cosities  as  well  as  introduce  a  mechanism  for  increasing  simulation  stability.  The  MRT 
projects  the  density  distribution  functions  /„  onto  an  orthogonal  vector  space  of  momenta 
of  the  vector  space  using  the  operator  M.  The  particular  momenta  depend  on  the  lattice 
structure  chosen  but  all  include  a  combination  of  the  mass  density,  kinetic  energy,  mo¬ 
mentum  flux,  energy  flux  and  viscous  stress  tensor.  They  are  expressed  in  the  vector  R. 
Relaxation  occurs  over  the  momentum  space  using  the  relaxation  times  given  in  S  and  the 
result  is  transformed  back  to  the  density  space  fa  using  the  inverse  of  M. 

For  the  D2Q9  lattice,  the  momentum  space  and  transformation  matrix  are  given  in 
Equations  15  and  Equation  16,  respectively. 


R 


D2Q9 


lT 


P  G  €  Jx  Qx  Jy  Qy  Pxx  Qxy 


(15) 
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-4  -1  -1  -1  -1  2  2  2  2 

4-2-2-2-211  1  1 

0  1  0-101-1-11 

Md2Q9  =  0  -2  0  2  0  1  -1  -1  -1  (16) 

0  0  1  0-111-1-1 

0  0  -2  0  2  1  1  -1  -1 

0  1  -1  1  -1  0  0  0  0 

0  0  0  0  0  1  -1  1  -1 

where  R  =  M/q,  and  p  is  fluid  density,  e  is  the  energy,  e  is  related  to  the  square  of  the 
energy,  and  jy  are  mass  fluxes,  qx  and  qy  correspond  to  energy  flux  and  pxx  and  pxy 
correspond  to  the  diagonal  and  off-diagonal  components  of  the  viscous  stress  tensor.  The 
coefficients  for  relaxation  over  this  momentum  space  are  given  in  a  diagonal  matrix  as  in 
Equation  17. 

S  =  diag  (0,  S2,  S3,  0,  ss,  sj,  ss,  sg)  (17) 

In  [25]  it  was  shown  that  the  same  fluid  viscosity  is  given  in  the  fluid  flow  when  ss  =  sg  = 
K  The  other  parameters  in  Equation  17  can  be  set  as  desired  so  as  to  promote  solution 
stability  or  as  required  to  further  tailor  fluid  behavior.  If  all  non-zero  coefficients  of  S  are 
set  to  LBGK  single -relaxation  time  is  recovered.  Having  specified  the  coefficients  of  S, 
the  LBM  collision  operator  is  as  shown  in  Equation  18, 

^MRT  _  if  -  /"^)  (18) 

where  all  values  of  fa  are  relaxed  with  a  single  matrix  collision  operator.  In  the  software 
developed  for  this  work,  it  has  been  observed  that  use  of  MRT  significantly  promotes  simu¬ 
lation  stability.  While  the  MRT  requires  more  computations  per  time-step,  it  has  been  found 
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that  simulations  are  able  to  be  conducted  with  much  lower  lattice  density.  Furthermore, 
fewer  time  steps  are  typically  required  for  flows  to  overcome  the  noise  of  nonequilibrium 
initial  conditions  and  reach  accurate  flow  configurations. 

D.  BOUNDARY  CONDITIONS 

The  fluid  flow  problems  which  we  hope  to  solve  using  LBM  are  initial  boundary 
value  problems.  As  such,  the  handling  of  initial  and  boundary  conditions  should  be  central 
to  any  discourse  on  numerical  solution  methods. 

One  problem  with  LBM  is  that  the  physically  relevant  and  observable  macroscopic 
flow  features  such  as  velocity  and  pressure  are  not  the  dependent  variables  in  the  governing 
equation;  rather  they  are  functions  of  the  dependent  variables.  While  it  is  possible  to  find  a 
reasonable  set  of  /„  corresponding  to  a  particular  pressure  and  velocity  there  is  in  general 
no  unique  way  to  do  this. 

Despite  this  difficulty,  researchers  have  formulated  numerous  schemes  that  try  to 
view  the  boundary  lattice  points  in  a  manner  consistent  with  every  other  lattice  point  with 
the  exception  that,  for  certain  lattice  directions  there  is  no  updated  data  streamed  in  from 
the  previous  time  step.  This  condition  is  illustrated  schematically  in  Figure  3. 

The  boundary  condition  schemes  described  in  this  section  represent  answers  to  the 
dual  questions: 

1.  What  value  should  be  given  for  each  /„  for  which  there  is  not  a  corre¬ 
sponding  neighbor? 

2.  How  can  this  be  done  so  as  the  desired  macroscopic  boundary  condition 
will  be  enforced? 

The  boundary  conditions  discussed  in  this  section  answer  these  questions.  The  dis¬ 
cussion  will  not  survey  all  available  methods,  but  only  those  implemented  for  this  research. 
Each  method  will  be  examined  on  the  basis  of  stability  and  implementation  effectiveness. 
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West  Boundary 


—  —  —  Unknown 
'►  Known 


Figure  3:  Schematic  of  lattice  point  on  west  domain  boundary. 

1.  Periodic  Boundaries 

Periodic  boundary  conditions  are  a  common  and  easy  to  implement  boundary  con¬ 
dition  with  LBM.  For  nodes  along  a  periodic  boundary,  the  node  along  the  corresponding 
periodic  boundary  is  assigned  as  the  nearest  neighbor  for  streaming  purposes.  The  density 
distribution  for  that  direction  is  replaced  accordingly  as  is  shown  in  Figure  4. 

_ I _ 

before  streaming 

^  after  streaming  with  periodic  boundary' 


Figure  4:  Streaming  of  /2  across  a  North/South  periodic  boundary. 


15 


2. 


Solid  Boundaries 


Solid  boundaries  appear  in  a  wide  variety  of  applieations.  The  LBM  solution  to 
a  flat  no-slip  boundary  is  the  so-ealled  “bounee-baek”  boundary  eondition  in  whieh  all 
unknown  values  of  /„  are  replaeed  with  the  values  that  are  known,  but  from  the  opposite 
direetion.  Additionally,  direetions  parallel  to  the  solid  boundary  are  also  reversed,  resulting 
in  the  exehange  of  density  distribution  values  for  all  opposing  direetions.  This  is  illustrated 
in  Figure  5.  Solid  boundaries  implemented  in  this  fashion  are  often  referred  to  as  “dry- 
nodes”  beeause  they  do  not  undergo  the  eollision  proeess.  This  simplifies  implementation 
and  exeeution  effieieney  eonsiderably  sinee  maeroseopie  values  need  not  be  eomputed  at 
these  nodes  nor  must  the  equilibrium  distribution  be  evaluated.  This  so-ealled  “on-grid” 
version  of  the  bounee-baek  boundary  eonditions  has  been  shown  to  be  first-order  aeeurate 
in  [26]. 


Before  Bounceback 


After  Bounceback 


Boundary 


Figure  5:  Applieation  of  on-grid  bounee-baek  boundary  eondition. 


In  [27],  Ladd  introdueed  an  alternate  seheme  where  the  lattiee  points  are  arranged 
so  that  the  physieal  wall  is  aetually  loeated  exaetly  half-way  between  the  first  fluid  point 
inside  the  domain  and  the  eorresponding  solid  node  representing  the  wall.  This  seheme  is 
illustrated  in  Figure  6  and  has  been  shown  to  exhibit  seeond-order  eonvergenee. 
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Fluid  Lattice  Point 


Solid 

Boundary 


Solid  Lattice  Point 


Bounced  Back 
Directions 


Figure  6:  Half-way  bounceback  solid  boundary  condition  schematic. 

3.  Moving  Solid  Boundaries 

This  boundary  condition  seeks  to  apply  an  appropriate  redistribution  to  the  den¬ 
sity  distribution  to  achieve  a  prescribed  velocity  to  the  moving  solid  while  maintaining 
mass  conservation.  It  finds  practical  application  in  fluid-structure  interaction  problems  as 
described  in  [28]  as  well  as  pure  benchmark  problems  such  as  the  lid-driven  cavity. 

During  each  collision  step,  the  values  of  fa  are  modified  according  to  Equation  19. 

frw„e„  ■  (u,,  -  u)  ^ 

Due  to  the  symmetry  of  the  lattice  vectors  Oq,  and  weights  Wa,  the  total  density  at  the 
lattice  is  invariant  through  the  execution  of  Equation  19  but  the  distributions  are  adjusted 
to  achieve  the  prescribed  boundary  velocity  u^c. 

To  the  best  of  this  author’s  knowledge,  no  formal  analysis  has  been  done  regarding 
the  stability  or  accuracy  properties  of  this  procedure.  As  a  heuristic  method  for  achieving 
a  desired  momentum  input  to  the  fluid  system  under  simulation  while  maintaining  conser¬ 
vation  of  mass,  it  is  very  appealing.  It  is  generally  formulated  for  any  lattice  structure  or 
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location  within  the  domain  and  in  fluid  simulations  where  it  has  been  used  for  this  work,  it 
has  shown  excellent  stability  properties.  As  for  accuracy,  it  was  used  in  the  lid-driven  cav¬ 
ity  benchmark  discussed  in  Chapter  III  where  it  is  shown  to  allow  for  excellent  agreement 
with  both  experimental  and  computational  data  reported  in  the  literature. 

4.  Prescribed  Velocity  or  Pressure  Boundaries 

Prescribed  velocity  and  pressure  boundary  conditions  constitute  an  indispensable 
tool  for  fluid  modeling  problems.  As  with  other  boundary  condition  schemes,  the  methods 
to  be  described  in  this  section  all  seek  to  assign  suitable  values  to  fa  for  lattice  points 
along  a  boundary  so  that  the  desired  macroscopic  conditions  are  realized.  An  excellent 
review  paper  can  be  found  at  [29]  that  formally  analyzes  several  methods.  Details  included 
in  this  section  are  drawn  largely  from  this  reference.  A  common  theme  among  boundary 
conditions  of  this  type  is  that  specification  of  either  density — which  in  the  LBM  framework 
is  equivalent  to  pressure  by  using  Equation  13 — or  velocity,  the  other  macroscopic  variable 
can  be  determined  based  solely  on  the  known  values  of  fa- 

Referring  to  Figure  7,  the  contributors  to  the  macroscopic  density  at  a  lattice  point 
can  be  grouped  into  three  categories:  those  stationary  or  parallel  to  the  boundary  po-  fo,  f 2, 
and  /4  in  this  case;  those  known  density  distributions  pointed  into  the  boundary  -  /3,/6, 
and  fj;  and  those  pointed  out  from  the  boundary  p_  -  /i,/5  and  /g.  Every  straight  boundary 
will  have  this  grouping.  Comparing  this  with  Equation  1 1,  it  can  be  seen  that  Equation  20  is 
an  identity.  Additionally,  considering  the  lattice  velocities  it  is  easy  to  demonstrate  that  the 
velocity  component  perpendicular  to  the  straight  boundary  can  be  expressed  as  in  Equation 
21. 


P  =  P-+  Po  +  p+ 

(20) 

pu±  =  P+-  p- 

(21) 

Given  Equations  20  and  21,  given  either  the  value  for  the  velocity  component  into  the 
domain,  u±  or  p,  it  is  always  possible  to  determine  the  other  using  only  the  density  distri¬ 
bution  components  that  are  known  after  streaming  -  po  and  p+.  These  relations  are  given  in 
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West  Boundary 


►  Known 
=  /l  +  /s  +  /s 

P^  =  fi+  fi 

Po=  Ia 


Figure  7:  Groups  of  density  distributions  on  west  boundary  lattice  point. 


Equations  22  and  23. 


P 


1 

l  +  u± 


(2p+  +  Po) 


(22) 


1  ,  (2p+  +  Po) 
=  -1 H - 

P 


(23) 


Once  the  value  for  p  and  u  are  known  on  the  boundary,  this  information  can  be  used  to 
judiciously  assign  values  either  to  the  unknown  density  distributions  as  in  the  Zou-He  type 
boundary  conditions  [30]  or,  in  the  case  of  the  regularized  boundary  conditions  introduced 
in  [31],  to  all  distributions. 


a.  Zou-He  Boundaries 

The  Zou-He  scheme  for  prescribed  pressure  or  velocity  seek  to  find  suitable 
values  only  for  the  unknown  density  distributions  at  the  boundary  lattice  point.  For  this  ex¬ 
ample,  the  case  of  a  prescribed- velocity  boundary  condition  on  the  West  domain  boundary 
as  depicted  in  Figure  7  will  be  used. 
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For  the  condition  depicted,  when  accounting  for  the  prescribed  velocity  on 
the  boundary,  there  remain  four  unknowns:  p,  /i,  /s  and  /g.  Balanced  against  these  four 
unknowns  is  the  known  relation  for  density  given  in  Equation  11,  and  two  equations  for 
momentum  given  by  Equation  12.  In  order  to  provide  closure,  a  fourth  relationship  is  nec¬ 
essary.  The  Zou-He  boundary  conditions  develop  this  relationship  by  assuming  “bounce- 
back”  of  the  non-equilibrium  part  of  the  density  distribution  directed  perpendicular  to  the 
boundary.  In  this  case,  this  gives  us  Equation  24. 


fi  -  ir  =  h-  fP  (24) 

Applying  Equation  10  along  with  the  known  p  and  u  and  appropriate  lattice  vectors  ei  and 
es  and  weights  tui,  tus,  the  above  relation  simplifies  to  Equation  25 


/i  =  /s  +  ‘^pUx  (25) 

In  two  dimensions,  this  provides  closure  and  values  of  the  remaining  unknown  density 
distribution  functions  can  be  determined.  Eor  this  example,  the  expressions  are  given  as 
Equation  26  and  Equation  27. 


/s  —  A  +  2  (A  ~  A)  +  ^P'^y  qP"^^ 


(26) 


/s  =  A  -  2  (A  -  A)  -  ^PUy  +  -pUx  (27) 

Eor  prescribed  pressure,  the  procedure  is  the  same,  except  that  given  p,  we  solve  for  u±  - 
in  this  case  u^. 

Eor  three  dimensions,  there  are  still  more  unknowns;  using  the  D3Q15  lat¬ 
tice  and  applying  a  boundary  condition  to  the  west  domain  boundary.  In  addition  to  one  of 
p  or  u±,  there  are  five  density  distributions  that  are  unknown:  A jA^AAii  ^nd  As-  In  the 
case  of  the  D3Q27  lattice,  there  are  nine  unknown  density  distribution  functions. 
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The  idea  demonstrated  in  [32]  is  to  apply  the  non-equilibrium  bounce-back 
as  used  in  Equation  24  to  all  unknown  density  distribution  values  relative  to  the  known 
density  distribution  in  the  opposite  direction.  If  we  adopt  the  convention  that  for  a  given 
density  distribution  f^,  fj.  connotes  the  density  distribution  traveling  in  the  opposite  direc¬ 
tion,  and  fk  is  an  unknown  density  distribution,  this  is  shown  in  Equation  28  for  the  D3Q15 
lattice. 

fk  =  h+  (/r  -  /r)  ,ke{l,  7, 9, 11, 13}  (28) 

In  order  to  correct  for  momenta  in  the  plane  of  the  boundary — for  this  example,  in  the  y  and 
2:  directions — an  adjustment  is  applied  to  the  density  distributions  that  are  not  perpendicular 
to  the  boundary,  which  is  shown  in  Equation  29  for  the  D3Q15  lattice. 

fk  =  fk  +  ^  [^ky  (/s  ~  fi)  +  ^kz  (/s  ~  /e)]  ,  k  ^  {7,9, 11, 13}  (29) 

Putting  this  all  together,  we  arrive  at  Equations  30  and  31. 

/l  =  /2  +  ‘^PinUx  (30) 

fk  =  fk  +  ^pinUx-^[eky{fs-f4)  +  ekz{f5-f6)]  , /c  G  {7,  9,  11,  13}  (31) 

Eor  other  lattice  types  and  other  boundaries,  the  same  general  prescription  is  followed. 
b.  Regularized  Boundaries 

Eor  regularized  boundary  conditions,  introduced  in  [31]  and  discussed  in 
more  detail  in  [29],  all  particle  distribution  functions  on  boundary  lattice  points  are  replaced 
based  on  values  of  p,  u  or  that  are  either  specified  by  the  boundary  condition  or 
computed  based  on  known  particle  distribution  values. 
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In  the  same  fashion  as  with  the  Zou-He  boundary  condition,  given  either 
p  ox  u^,  the  other  macroscopic  variable  can  be  determined  based  on  the  known  values 
of  fa-  Once  this  is  complete,  is  computed  using  Equation  10.  Values  for  the  unknown 
density  distribution  function — — are  initially  estimated  based  on  bounce-back  of  the  non¬ 
equilibrium  parts  as  in  Equation  28.  Since  we  cannot  know  the  non-equilibrium  portion  of 
fk,  we  simply  assign  it  to  have  the  same  non-equilibrium  component  as  /j  as  shown  in 
Equation  32. 


fk  =  /r +h-  /f  (32) 

The  tensor  11  is  the  second-order  moment  of  the  particle  density  populations  and  can  be 
expressed  as  in  Equation  33. 

q-l 

n  =  ^  BaOafa  (33) 

q;=0 

Equation  34  is  used  to  reconstruct  hydrodynamically  consistent  values  for  all  fa  on  the 
boundary  lattice  point. 


111 

fa  =  rj  +  :  n  (34) 

where  Qa  =  0^60,  —  c^I,  I  being  the  identity  matrix.  This  method  for  boundary  condition 
application  is  appealing  for  its  generality.  The  superior  stability  properties  of  this  method 
is  discussed  at  length  in  [29]  and  [31]. 

E.  BODY  FORCES 

Many  physical  simulations  require  the  application  of  a  body  force.  Some  simple 
examples  would  be  a  simulation  that  includes  gravity;  a  simulation  with  an  imposed  dif¬ 
ferential  pressure,  where  the  pressure  is  included  as  a  body  force  on  the  fluid  particles;  or 
a  multi-component  model  where  the  interaction  between  particles  of  different  species  are 
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modeled  by  nearest- neighbor  force  inputs.  The  most  common  way  to  incorporate  these 
forces  is  via  an  adjustment  to  the  equilibrium  velocity  as  given  in  Equation  35, 

=  u  +  Au  (35) 

where  Au  is  given  by  Equation  36. 

tF 

Au  =  —  (36) 

P 

Equation  36  can  be  understood  heuristically  by  considering  F  =  ma  =  with 
the  relaxation  parameter  r  taking  the  role  of  the  differential  in  time. 

F.  SCALING 

The  goal  of  any  numeric  simulation  is  to  obtain  quantitative  and  qualitative  results 
that  can  be  applied  to  a  particular  physical  system  of  interest.  Much  EBM  literature  is  cast 
in  “EBM  units”  where  the  distance  between  lattice  points  and  the  time  for  each  time  step  is 
unity.  This  presents  a  clean  palate  on  which  to  develop  the  theory,  but  leaves  out  the  crucial 
details  of  how  to  tailor  EBM  simulation  parameters  so  that  the  results  can  be  related  to  a 
particular  set  of  fluid  conditions. 

In  previous  sections,  the  equations  relevant  for  staging  and  executing  a  EBM  simu¬ 
lation  were  presented  entirely  in  lattice  units — where  every  time  step  is  of  unit  length  and 
the  distance  between  any  adjacent  lattice  sites  is  also  of  unit  length.  Since  this  basic  system 
of  units  is  generally  unsuitable  for  physical  problems,  basic  physical  parameters  given  in 
some  units  of  length  and  time  must  be  re-scaled  consistently  so  that  these  parameters  can 
be  converted  into  units  suitable  for  incorporation  into  the  EBM  algorithm. 

As  an  intermediate  step,  it  is  sometimes  customary  to  rescale  physical  units  to  non- 
dimensional  units.  This  is  particularly  useful  in  cases  where  knowledge  of  the  system  state 
in  terms  of  some  non-dimensional  parameters  such  as  the  Reynolds  number  is  needed. 
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The  nondimensionalization  and  scaling  scheme  employed  for  this  research  is  illustrated  in 
Figure  8. 
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Figure  8:  Scaling  from  physical  units,  to  dimensionless  units  to  LBM  units. 


G.  EXAMPLE 

In  an  attempt  to  clarify  the  discussion  from  this  chapter,  the  LBM  formulation  of  a 
model  problem  will  be  discussed  in  detail. 

1.  Problem  Description 

The  procedures  discussed  in  this  chapter  are  summarized  in  the  flowchart  appearing 
in  Figure  10.  The  problem  to  be  considered  is  illustrated  schematically  in  Figure  9.  The 
problem  involves  flow  within  a  two-dimensional  channel  around  a  cylindrical  obstacle. 
Flow  enters  from  the  left  boundary  with  a  prescribed  parabolic  velocity  and  exits  out  the 
right  boundary  with  a  prescribed  constant  pressure.  The  top  and  bottom  boundary  are 
modeled  as  no-slip  walls. 

2.  Scaling  and  Setup 

The  process  of  scaling  for  this  example  problem  will  be  completed  in  two  steps  as 
described  in  the  section  on  scaling.  First,  a  characteristic  time  scale  Tq  and  length  scale 
Lq  will  be  identified.  For  this  problem  of  a  cylindrical  obstruction  in  two  dimensional 
channel  flow,  the  natural  choice  is  to  use  the  conventions  for  Reynolds  scaling  where  the 
characteristic  length  is  the  diameter  of  the  cylinder.  Therefore,  the  characteristic  length  in 
physical  units  L^p  =  0.2  m.  The  characteristic  time  is  assigned  to  be  the  time  required 
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Figure  9:  Schematic  diagram  of  channel  flow  example  problem. 


for  an  average  fluid  particle  to  traverse  the  diameter  of  the  cylinder.  For  the  assigned  inlet 
boundary  condition,  the  average  fluid  velocity  is  two-thirds  of  the  maximum  inlet  velocity 
of  0.5  ^  for  the  parabolic  profile.  Consequently,  To,p  =  =  ||  =  0.4  sec  All  of  this 

corresponds  to  a  Reynolds  number  of  100,  which  is  convenient  to  know  when  comparing 
the  output  of  the  LBM  simulation  against  experimental  data  or  benchmark  values. 

The  second  step  is  to  decide  how  finely  the  reference  time  and  space  scales  are 
to  be  subdivided.  For  this  example,  the  reference  length  Lq  will  be  represented  with  25 
lattice  points,  so  there  are  intervals  in  the  reference  length.  In  terms  of  dimensionless  units, 
Lq  =  1.  The  conversion  between  dimensionless  units  and  the  LBM  units  is  therefore: 
Llbm  =  =  25^  =  0.0417.  To  convert  between  a  distance  in  terms  of  lattice  units  and 

a  distance  in  physical  units,  one  would  multiply  by  both  the  conversion  factors.  Therefore, 
the  physical  spacing  of  the  lattice  points  is  5^  x  L^p  =  0.0083  m.  Similarly  the  time 
domain  is  discretized  by  deciding  how  many  time  steps  will  be  used  to  traverse  a  single 
unit  of  the  reference  time  Tq.  For  this  problem,  the  reference  time  will  be  divided  by 
250  time  steps,  so  =  ^  =  ^  =  0.004  seconds.  As  with  the  spatial  scaling,  in 
order  to  convert  a  single  LBM  time  step  to  physical  elapsed  time,  one  must  multiply  by 


the  scaling  parameters  between  the  Dimensionless  units  and  LBM  units  in  addition  to  the 
conversion  between  Physical  and  Dimensionless  units.  For  this  problem,  those  conversions 
are  Tp  =  Tlbm  x  x  Tq  =  0.0016. 

Once  the  spatial  and  temporal  scaling  factors  are  determined,  the  properly  scaled 
LBM  parameters  must  be  determined  from  the  given  physical  data. 

a.  Viscosity  Scaling 

In  order  to  get  dynamic  LBM  behavior  that  corresponds  to  the  desired  phys¬ 
ical  fluid  under  the  prescribed  conditions,  the  temporal  and  spatial  scaling  factors  need  to 

2 

be  applied  to  the  specified  fluid  kinematic  viscosity  v.  Since  v  has  units  of  ^  the  necessary 
conversion  can  be  accomplished  via  Equation  37. 

^LBM  ^Physical  \^ ' ) 

\LqOx) 

Carrying  out  this  conversion  for  the  specified  fluid  with  the  chosen  dis¬ 
cretization  results  in  z/lbm  =  0.023.  This  is  the  value  that  is  applied  to  Equation  14  to 
find  the  EBM  relaxation  parameter.  Doing  so  for  this  problem  results  in  r  =  0.57  or 
w  =  1.76 


b.  Velocity  BC  Scaling 

This  problem  has  a  prescribed  inlet  velocity  profile.  The  velocity  is  ex¬ 
pressed  in  terms  of  meters-per-second,  which  is  not  compatible  with  the  unit  system  as¬ 
sumed  when  the  EBM  boundary  conditions  were  developed.  The  velocity  is  simply  scaled 
in  accordance  with  Equation  38. 


^LBM  —  ^Physical  ^  t  ^ 

J^qOx 


Eor  this  problem,  this  conversion  reads: 


u{0,y) 


3  L  (I/-0-5) 

4  0.5 


0.0016 

0.0083 


0.144 


{y  -  0-5) 
0.5 


(38) 
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This  is  the  velocity  that  will  be  passed  to  the  LBM  time- stepping  routine  in  order  to  set  the 
prescribed  velocity  at  the  inlet  lattice  points. 

c.  Pressure  BC  Scaling 

For  this  problem,  the  prescribed  outlet  pressure  is  set  to  0  Pa.  This  is  a 
relative  pressure,  of  course,  otherwise  the  density  for  lattice  points  at  the  outlet  would 
need  to  be  set  to  zero  according  to  Equation  13.  Since  the  prescribed  pressure  boundary 
condition  procedures  are  actually  methods  for  enforcing  a  specific  density  at  the  boundary, 
in  order  to  employ  the  pressure  boundary  condition  for  evaluating  pressure  on  the  domain 
we  must: 

•  Compute  the  pressure  through  the  domain  using  Equation  11  and  Equation  13  along 
with  the  value  of  Cs  applicable  for  the  lattice  in  use. 

•  Scale  the  computed  pressure  to  physical  units  using  Equation  39.  In  this  step,  one  is 
simply  converting  cf  to  physical  units. 

•  Adjust  the  pressure  in  physical  units  so  that  the  boundary  condition  is  satisfied  as  in 
Equation  40. 


ry  _  2  ( 

Physical  PlEMC*  I 

\  Ot-i-o 


(39) 


P 


P  —  P 

^  Physical  ^  Boundary 


(40) 


3.  Initialization  and  Lattice  Point  Classification 

As  final  steps  before  commencing  the  EBM  time-stepping  routine,  the  initial  values 
for  fa  must  be  established  for  all  lattice  points.  Additionally,  all  boundary  lattice  points, 
solid  lattice  points,  and  any  other  type  of  lattice  point  that  will  require  special  treatment 
must  be  identified.  Eor  this  problem,  we  need  only  identify  inlet  nodes  for  the  prescribed 
velocity  boundary  condition,  outlet  nodes  for  the  pressure  boundary  condition  as  well  as 
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solid  nodes  for  the  top  and  bottom  boundary  as  well  as  the  cylindrical  obstruction.  Each  of 
these  classes  of  lattice  points  will  be  treated  with  distinction  during  each  time-step. 

There  is  no  solidly  established  means  for  establishing  an  initial  condition.  A  com¬ 
mon  choice  for  many  problems  is  to  simply  set  the  initial  set  of  fa  for  each  lattice  point 
equal  to  the  equilibrium  density  distribution  as  computed  with  Equation  10  to  some  pre¬ 
determined  velocity  and  density  distribution.  This  is  shown  in  Equation  41  for  the  D2Q9 
lattice. 


fa  (X,  0)  =  =  pWc 


9  2  3 

1  -f  3  (e«  ■  uo)  -f  -  (Oo  ■  uo)  -  -  (uo  ■  uq) 


(41) 


Similarly,  there  is  no  standardized  procedure  for  generating  the  lattice  domain, 
identifying  inlet  and  outlet  lattice  points  or  lattice  points  along  solid  boundaries.  General- 
purpose,  Open-source  lattice-generating  software  that  could  carry  out  tasks  such  as  these 
on  more  complex  geometries  do  not  seem  to  exist.  Eor  problems  with  simple  geometry,  as 
this  example  problem  does,  this  task  can  be  executed  quite  efficiently  with  simple  searches 
based  on  lattice  point  geometric  position,  (e.g,  all  of  the  inlet  lattice  points  can  simply  be 
found  by  identifying  all  of  those  points  that  lie  along  a;  =  0  and  where  y  ^  0  and  y  ^  1. 

4.  Time-Stepping 

The  basic  time-stepping  scheme  is  illustrated  in  the  flowchart  shown  in  Eigure  10. 
Eluid  nodes  not  on  a  boundary  compute  values  for  macroscopic  density  and  velocity  using 
Equation  11  and  12.  These  values  are  used  to  compute  fff  using  Equation  10.  Using 
the  scaled  relaxation  parameter  computed  using  Equations  37  and  14,  perform  the  BGK 
relaxation  using  Equation  9.  A  visualization  of  the  results  after  50,000  time  steps  are 
shown  in  Eigure  11.  This  represents  approximately  80  seconds  of  physical  time  according 
to  our  time  scaling  factor  computed  above. 
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Apply  Microscopic 
and  Macroscopic 
Boundary  Conditions 


Figure  11:  Velocity  magnitude,  pressure,  and  vorticity  magnitude  for  example  flow  case 
after  50,000  time  steps. 
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III.  IMPLEMENTATION  AND  VALIDATION 


In  order  to  model  fluid  structure  interactions  with  LBM  and  FEM,  it  is  necessary 
to  have  available  a  reliable,  flexible  and  powerful  implementation  of  the  LBM.  It  must 
first  be  reliable  so  that  we  can  have  some  reasonable  hope  that  the  results  obtained  will  be 
comparable  to  physical  reality.  It  must  be  flexible  so  that  a  wide  variety  of  simulations  can 
be  conducted,  pertaining  to  different  physical  conditions  of  interest.  It  must  be  powerful  so 
that  these  simulations  can  be  refined  for  greater  accuracy  while  still  able  to  be  completed 
in  a  reasonable  amount  of  time. 

In  this  chapter,  a  body  of  software  tools,  designed  using  the  theory  presented  in 
Chapter  II,  is  tested  against  a  selection  of  standard  benchmarks  first  for  accuracy  and  then 
for  performance.  The  following  results  are  obtained  from  this  work: 

•  The  LBM  software  as  implemented  for  this  work  reliably  reproduces  re¬ 
sults  obtained  Poiseuille  flow,  lid-driven  cavity,  flow  over  a  backward  step 
and  flow  over  a  cylindrical  obstruction  for  2D. 

•  The  LBM  provides  second-order  convergence  to  the  2D  Poiseuille  flow 
with  appropriate  boundary  conditions  and  double  precision  arithmetic.  A 
mixed-precision  algorithm  is  introduced  which  allows  similar  second-order 
convergence  while  only  storing  single-precision  data. 

•  The  LBM  is  benchmarked  for  performance  against  recently  published  im¬ 
plementations  for  3D  lid-driven  cavity  flow.  It  is  shown  that  the  software 
produced  for  this  work  has  competitive  performance  with  other  single¬ 
workstation  implementations  recently  produced. 

A.  POISEUILLE  FLOW 

The  first  test  case  is  a  two-dimensional  flow  case  between  parallel  plates.  The  flow 
condition  is  depicted  in  Figure  12. 
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Figure  12:  Poiseuille  flow  configuration. 


Channel  Width  (2  x  h) 

1.0  m 

fluid  density  (p) 

1000  ^ 

fluid  viscosity  (p) 

1  ^ 

Maximum  inlet  velocity  ((/max) 

0.015 

s 

Table  1 :  Geometry  and  fluid  parameters  for  Poiseuille  flow  test  case. 


The  solution  to  this  problem  is  known  to  be  a  function  of  y  only  and  is  given  in 
Equation  42: 

with  ^  given  as  a  function  of  maximum  velocity  and  fluid  viscosity  in  Equation  43 


dp  2/i 

dx  W' 


(43) 


The  maximum  velocity  is  set  by  the  inlet  boundary  condition.  Specific  fluid  and  flow 
conditions  are  presented  in  Table  1 . 
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1.  Solution  with  On-Grid  Bounceback  Boundary  Conditions 

For  the  LBM  model  of  this  problem,  the  D2Q9  lattice  with  LBGK  dynamics  was 
used  along  with  Zou/He  boundary  conditions  for  the  prescribed  velocity  on  the  west  bound¬ 
ary  and  constant  prescribed  pressure  on  the  east  boundary.  The  initial  lattice  discretization 
was  set  so  that  30  lattice  points  would  span  the  channel  entrance.  The  time  step  was  set  to 
achieve  a  relaxation  parameter  u  of  1.30.  While  refining  the  grid  to  test  for  convergence, 
the  time  step  was  adjusted  so  as  to  maintain  a  constant  relaxation  parameter  for  all  tests. 
The  results  are  shown  in  Figure  13.  As  expected,  first-order  convergence  is  obtained  for 
this  style  of  boundary  condition. 

2.  Solution  with  Half-Way  Bounceback  Boundary  Conditions 

The  half-way  bounce-back  boundary  condition  was  implemented  and  used  in  an 
identical  set  of  tests.  The  goal  of  this  step,  in  addition  to  showing  the  convergence  proper¬ 
ties  of  the  boundary  condition,  is  to  illustrate  the  second-order  convergence  properties  of 
the  LBM  as  a  whole.  Results  for  single  precision  are  shown  in  Figure  14.  It  is  clear  from  the 
figure  that,  for  problems  with  modest  accuracy  and  comparatively  coarse  lattice  densities, 
second-order  convergence  is  obtained.  For  more  refined  lattices,  however,  the  expected 
convergence  rate  is  lost  in  single-precision.  To  investigate  the  effect  of  the  numerical  pre¬ 
cision  in  which  the  software  is  written,  an  alternate  implementation  was  generated  utilizing 
double  precision  arithmetic  for  all  LBM  calculations.  Results  of  this  convergence  test  are 
shown  in  Figure  15.  In  order  to  avoid  the  cost  of  double  precision  computations  a  mixed 
precision  LBM  kernel  was  developed.  Through  an  experimental  analysis  of  the  sources 
of  error  in  the  LBM  computations,  it  was  found  that  numerical  results  nearly  identical  to 
that  obtained  with  full  double  precision  computations  could  be  obtained  by  conducting 
only  computation  of  and  collisions  in  double  precision.  Convergence  results  for  this 
computation  are  shown  in  Figure  16. 

The  mixed  precision  brings  the  accuracy  of  double  precision  with  a  lower  cost  for 
memory  consumption  and  with  better  performance  than  a  pure  double  precision  compu¬ 
tation.  The  memory  cost  for  working  in  double  precision  is  simply  twice  that  of  single 
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Figure  13:  Poiseuille  flow  convergence  with  On-Grid  bounce-back  boundary  conditions. 
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Log  Grid  Resolution 


Figure  14:  Poiseuille  flow  convergence  with  half-way  bounce-back  boundary  conditions  in 
single  precision. 


Figure  15:  Poiseuille  flow  convergence  with  half-way  bounce-back  boundary  conditions  in 
double  precision. 
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Half-Way  Mixed  Precision 


Figure  16:  Poiseuille  flow  convergence  with  half-way  bounce-back  boundary  conditions 
using  mixed-precision  arithmetic. 

precision.  The  relative  computational  performance  of  single,  mixed  and  double  precision 
are  presented  in  Figure  17  for  two-dimensional  Poiseuille  flow  using  the  LBGK  collision 
operator.  For  less  refined  lattices,  the  double  precision  performance  is  nearly  identical  to 
single  precision  and  both  are  slightly  higher  than  mixed  precision.  For  more  dense  lattices 
the  additional  memory-bandwidth  load  of  passing  double  precision  operands  to  the  com¬ 
putational  routines  becomes  more  important  than  penalties  paid  for  type  conversions  and 
mixed  precision  outperforms  double  precision. 

3.  Stability  and  Accuracy 

In  the  preceding  section,  it  may  have  seemed  arbitrary  to  have  selected  a  constant 
relaxation  parameter  oo  =  1.25.  This  conclusion  is  partially  correct  insofar  as  there  is 
considerable  flexibility  as  to  how  this  value  is  picked.  We  recall  from  Chapter  II  that, 
including  effects  of  scaling  in  time  and  space,  the  fluid  viscosity  in  LBM-units  scales  by  U 
in  accordance  with  Equation  37.  Consequently,  if  6^  is  reduced  by  a  factor  of  2,  5t  must  be 
reduced  by  a  factor  of  4.  With  this  refined  time  step,  the  number  of  time  steps  is  increased 
by  a  factor  of  4  for  the  fluid  simulation,  including  the  time  required  for  the  LBM  simulation 
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Performance  vs.  Lattice  Refinement 


Figure  17:  Relative  performance  of  single  precision  (SP),  mixed  precision  (MP)  and  double 
precision  (DP)  computational  routines  for  Poiseuille  flow.  The  lattice  refinement  parameter 
refers  to  the  number  of  lattice  points  placed  across  in  the  dimension  of  the  channel  opening. 

to  arrive  at  an  equilibrium  from  non-equilibrium  initial  conditions.'  For  a  given  value  of  oj, 
this  initial  instability  can  last  for  many  time  steps.  Even  for  this  simple  problem  geometry, 
the  LBM  system  does  not  reach  a  stable  answer  for  many  time  steps.  An  illustration  of  this 
is  given  in  Figure  18.  This  figure  shows  variation  in  the  horizontal  velocity  at  the  center  of 
the  channel  geometry  for  lattice  density  of  Ny  =  30  and  480,  respectively.  Note  the  change 
in  time  scales  for  the  time-step  axis.  A  more  detailed  and  comprehensive  report  of  the 
stability  properties  of  LBM  along  with  a  comparison  between  LBGK  and  MRT  collision 
operators  can  be  found  in  [33]. 


'in  this  case,  transition  from  zero  velocity  everywhere  in  the  domain,  to  the  parabolic  velocity  profile 
that  is  the  solution. 
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Horizontal  Velocity  Fluctuation  x/Lx  =  0.75  vs  Time  Step,  Re  =  10 


X  10 


Horizontal  Velocity  Fluctuation  x/Lx  =  0.75  vs  Time  Step,  Re  =  10 


Time  Step  s 

X  10 

Figure  18:  Stabilization  time  for  Poiseuille  flow,  Re=10,  ^  =  oj  =  1.3.  Top  figure,  Ny=30, 
bottom  figure,  Ny=480. 
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B.  BACKWARD  FACING  STEP 


The  occurrence  of  flow  separation  of  internal  flows  by  sudden  geometric  changes 
is  well  known  and  is  important  to  engineering  applications;  FSI  in  particular.  While  the 
Poiseuille  flow  test  case  is  convenient  for  code  validation  insofar  as  analytic  solutions  are 
known,  it  does  not  fully  test  the  ability  of  the  LBM  to  reproduce  correct  fluid  behavior. 
It  will  be  shown  that  for  modest  Reynolds  numbers  typical  for  those  that  will  be  used  for 
the  FSI  applications  in  this  dissertation,  the  LBM  captures  flow  separation  typified  by  the 
backward  facing  step  problem  accurately. 

For  this  benchmark  study,  the  experimental  results  presented  in  [34]  will  be  used. 
The  main  benchmark  result  against  which  the  LBM  will  be  tested  is  illustrated  in  Figure 
19.  The  problem  set-up  is  as  depicted  in  Figure  20.  For  these  computations,  the  inlet 
boundary  conditions  are  prescribed  velocity  and  outlet  is  prescribed  pressure;  both  of  the 
Zou/He  type.  The  MRT  scheme  for  D2Q9  is  used  for  bulk  dynamics.  Measurements  were 
taken  based  on  streamlines  computed  from  the  velocity  data  by  Paraview.  An  example  for 
Reynolds  number  100  is  provided  in  Figure  21. 

In  order  to  conveniently  compare  with  measured  benchmark  values,  representative 
data  points  are  pulled  from  Figure  19  and  plotted  separately  along  with  the  values  taken 
from  computed  data.  For  each  successively  increased  Reynolds  number,  the  lattice  density 
and  time  step  were  both  adjusted  so  as  to  maintain  a  constant  relaxation  factor  u  =  1.25. 
Results  are  given  in  Figure  22.  Good  agreement  can  be  seen  in  this  case  with  experimental 
results. 
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Figure  19:  Backward  step  flow  separation  behavior.  Image  taken  from  [34] 
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p(,L,  y,  t)=‘Pg^=  Constant 
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Figure  20:  Schematic  of  domain  and  boundary  conditions  for  Backward-Step  benchmark 
in  2D. 
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Backward  Step,  Re=TOO 


Figure  21:  Backward-Step  simulation.  Step  height  =  0.25m,  outlet  width=0.5m,  Re=100. 


Figure  22:  Comparison  of  primary  vortex  re-attachment  length  normalized  by  step  height 
with  results  reported  in  [35]. 
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c. 


LID-DRIVEN  CAVITY 


As  one  validation  of  the  LBM  implementation,  the  software  was  used  to  simulate 
lid-driven  flow  in  two  dimensions.  A  eommonly  used  benchmark  for  this  flow  condition  is 
given  in  [35].  A  schematic  illustration  of  the  two-dimensional  lid-driven  cavity  problem  is 
given  in  Figure  23. 


u(_x,  t)  =  constant  = 

- > 


No-Slip  Boundary 

wwww 


Figure  23:  Schematic  of  the  two-dimensional  lid-driven  cavity  problem. 

Boundary  conditions  used  for  this  project  are  on-grid  bounce-back  to  model  no-slip 
stationary  walls.  The  moving-wall  boundary  condition  is  used  to  model  the  lid.  The  lid 
velocity  is  set  to  a  constant  value  in  the  x  direction  with  the  desired  speed.  The  standard 
benchmark  stipulates  a  Reynolds  number  of  1000,  though  other  authors  have  published 
results  at  higher  Re.  In  two  dimensions,  the  D2Q9  lattice  is  used  with  either  LBGK  or 
MRT  bulk  dynamics. 

In  Figure  24,  a  qualitative  comparison  is  made  with  results  reported  in  [36].  In  Fig¬ 
ure  25,  a  qualitative  comparison  is  made  of  the  streamlines,  pressure  contours  and  vorticity 
contours  with  those  published  in  [37].  Quantitative  comparisons  are  made  in  the  following 
figures. 
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Figure  24:  Lid-driven  cavity  in  two  dimensions  with  1600x1600  lattice  showing  from  left- 
to-right  streamlines,  vorticity  contours  and  pressure  contours  for  Re=1000.  Top  set  of 
figures  is  LBM  from  this  work.  Bottom  set  of  figures  is  from  [36]. 


Figure  25:  Lid-driven  cavity  in  two  dimensions  with  1600x1600  lattice  showing  from  left- 
to-right  streamlines,  vorticity  contours  and  pressure  contours  for  Re=5000.  Top  set  of 
figures  is  LBM,  bottom  set  of  figures  is  from  [37]. 
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In  Figure  26,  two  samples  of  velocity  are  taken  in  the  computed  problem  domain. 
The  first  is  the  x  velocity  component  sampled  along  the  vertical  center-line.  As  can  be  seen, 
good  agreement  with  benchmark  values  is  obtained  for  all  macroscopic  fluid  parameters. 
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X-velocity  Comparison  to  Benchmark 


Y-Velocity  Comparison  to  Benchmark 


Pressure  Profile  Horizontal  Slice 


Pressure  Profile  Vertical  Slice 
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Vorticity  Comparison  Horizontal  Slice 
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Vorticity  Comparison  Vertical  Slice 


Vorticity 


Figure  26:  Comparison  of  velocity,  pressure  and  vorticity  to  benchmark  values  for 
Re=1000. 
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D.  CHANNEL  FLOW  OVER  CYLINDER 


The  flow  conflguration  for  this  benchmark  is  illustrated  in  Figure  27.  Constant 
velocity  is  specified  on  the  inlet  and  constant  pressure  is  specified  on  the  outlet.  The  top 
and  bottom  of  the  domain  are  simulated  as  periodic  boundaries  with  the  result  that  the 
effective  domain  is  a  linear  array  of  cylinders  in  uniform  flow.  Similar  work  cited  for  the 
benchmarks  are  published  in  [3  8] -[44].  The  trailing  vortex  region  for  this  benchmark  was 
measured  visually  following  flow  simulation.  Numeric  results  are  given  in  Table  2  and 
graphically  in  Figure  28.  It  can  be  seen  from  the  table  that  the  computed  values  using  LBM 
are  comparable  to  those  reported  in  the  literature. 
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Figure  27:  Channel  with  cylindrical  obstacle  2D  problem. 


Author 

Re  =  20 

Re  =  40 

Zhou  (2012) 

0.92 

2.20 

Calhoun  (2002) 

0.91 

2.18 

Rusell  (2003) 

0.94 

2.35 

Silva  (2003) 

1.04 

2.55 

This  work 

0.95 

2.05 

Table  2:  Comparison  of  trailing  vortex  length  to  benchmark  values. 
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Figure  28:  Streamline  visualization  of  trailing  vortex  at  Re=20  (top)  and  Re=40  (bottom). 


Above  a  Reynolds  number  of  approximately  45,  the  trailing  vortex  detaehes  in  an 
alternating  pattern  referred  to  as  a  Von  Karman  street.  As  a  last  measure,  the  rate  of  vortex 
shedding  was  measured  and  the  non-dimensional  Strouhal  number  was  evaluated  with  the 
result  eompared  to  the  literature.  A  visualization  of  the  vortieity  in  the  wake  of  a  eireular 
ey Under  in  ehannel  flow  at  Reynolds  number  of  100  is  given  in  Figure  29.  Notiee  the 
alternating  regions  of  positive  and  negative  vortieity  resulting  from  the  vortex  shedding 
alternately  from  the  top  and  bottom  of  the  eylinder.  The  equation  for  the  Strouhal  number 
is  given  in  Equation  44, 


^  (44) 

'Jo 

where  St  is  the  non-dimensional  Strouhal  number,  /  is  the  frequeney  of  vortex  shedding, 
Lq  is  the  eharaeteristie  length  and  Uq  is  the  eharaeteristie  veloeity.  For  this  problem,  the 
eharaeteristie  length  is  the  diameter  of  the  eylinder  and  the  eharaeteristie  veloeity  is  the 
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average  flow  veloeity.  During  flow  simulation,  the  drag  and  lift  forees  were  eomputed  with 
results  presented  in  Figure  30.  The  Strouhal  number  for  this  simulation  was  determined  by 
taking  the  diserete  Fourier  transform  of  eomputed  eoeffieient  of  lift  data  and  applying  this 
along  with  Uq  and  Lq  in  Equation  44.  The  resulting  speetrum  is  presented  in  Figure  31. 
The  result  is  eompared  with  others  reported  in  the  literature  and  shows  good  agreement. 
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Figure  29:  Vortieity  plot  for  eylinder  in  2D  flow  at  Re=100. 


Author 

Strouhal  number  (St) 

Williamson  (1996) 

0.166 

Lai  (2000) 

0.165 

Silva  (2003) 

0.16 

Xu  (2006) 

0.171 

Zhou  (2012) 

0.172 

This  work 

0.169 

Table  3:  Comparison  of  Strouhal  number  to  benehmark  values. 
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Drag/Lift  Coefficient  versus  Time,  Re=100 


Figure  30:  Drag  and  lift  coefficient  for  cylinder  in  uniform  flow.  Re=100. 
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Figure  31:  Strouhal  number  computed  from  the  energy  spectra  of  the  lift  coefficient  at 
Re=100. 
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IV.  LBM  IMPLEMENTATION  ON  GRAPHICS  PROCESSING 

UNITS 


With  recent  advances  in  modem  GPUs  the  interest  in  using  these  devices  for  sci¬ 
entific  calculations  has  been  growing.  In  particular,  the  memory-bandwidth  and  compute 
capability  of  GPUs  compared  to  contemporary  CPUs  has  made  their  use  for  LBM  ap¬ 
plications  particularly  appealing.  Implementations  of  the  LBM  using  CUDA  and  the  C 
programming  language  have  been  published  and  the  viability  of  the  GPU  as  an  effective 
platform  for  executing  the  LBM  has  been  demonstrated  extensively  [45]-[50]. 

In  this  chapter,  the  computational  requirements  for  the  LBM  will  be  reviewed.  Next 
the  NVIDIA  CUDA  GPU  computing  platform  is  introduced  and  its  use  for  the  LBM  sim¬ 
ulations  presented  in  this  work  is  outlined.  Comparisons  are  made  to  recently  published 
performance  benchmarks  for  systems  with  a  single  GPU.  Lastly,  multi-GPU  implementa¬ 
tions  of  the  LBM  in  hybrid  parallel  schemes  employing  CUDA  with  OpenMP  as  well  as 
CUDA  with  MPI  will  be  presented.  Performance  of  these  codes  are  compared  with  a  more 
conventional  parallel  implementation  with  MPI. 

A.  COMPUTATIONAL  REQUIREMENTS  FOR  THE  LBM 

Though  the  operations  to  be  executed  for  each  lattice  point  during  each  time  step  are 
conceptually  straightforward  and  easy  to  implement  on  a  computer,  it  has  been  well  recog¬ 
nized  in  the  literature  that  the  LBM  is  particularly  computationally  intensive  and  memory 
demanding  [51].  Considerations  for  precision  and  stability  combine  to  dictate  a  require¬ 
ment  that  a  large  number  of  lattice  points  are  needed  to  effectively  discretize  a  problem 
domain.  Within  the  classical  LBM  formulation,  all  lattice  points  must  be  equally  spaced 
implying  that  LBM  solutions  for  problems  with  widely  varying  length  and  time  scales  of 
interest  would  perform  worse  than  an  equivalent  finite  element  method  (FEM)  or  finite 
volume  method  (FVM)  code  where  non-uniform  meshes  may  be  used.  In  addition  to  the 
large  number  of  lattice  points,  each  lattice  point  in  the  domain  requires  storage  for  each 
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value  of  /„;  9  values  for  the  popular  D2Q9  lattiee,  13-27  values  for  most  eommonly  used 
three-dimensional  lattiees.  This  is  roughly  double  the  requirement  for  a  more  traditional 
solver  for  the  ineompressible  Navier-Stokes  equation  [16]. 

In  addition  to  ample  memory  and  eomputational  eapability,  the  LBM  requires  very 
great  memory  bandwidth  so  that  data  ean  be  streamed  into  the  eomputing  eores.  For  eaeh 
time  step,  eaeh  value  of  fa  must  be  loaded  from  memory  and  stored  again  at  least  once.  The 
number  of  floating  point  operations  needed  depends  upon  the  choice  of  boundary  condi¬ 
tions  and  collision  operator,  but  as  a  rule  of  thumb,  roughly  20  floating  point  operations  are 
required  for  every  value  of  fa-  This  implies  that  in  order  to  achieve  a  computational  perfor¬ 
mance  of  a  high-end  CPU,  say  500  billion  floating  point  operations  per  second  (GFLOPS), 
with  single-precision  arithmetic,  the  computing  device  would  require  a  memory  bandwidth 
of  at  least  200  gigabytes  (GB)  per  second. 

The  memory-bandwidth  and  theoretical  computational  capabilities  of  selected  CPUs 
and  GPUs  is  depicted  in  Figure  32.  The  relationship  between  achievable  computational  per- 

Theoretical  GB/s  Theoretical 


Figure  32:  Historical  trends  for  CPU  and  GPU  memory  bandwidth  and  compute  perfor¬ 
mance  (From  [52]). 

formance  versus  memory  bandwidth  available  for  a  typical  LBM  problem  is  illustrated  in 
Figure  33  along  with  the  computing  and  memory  bandwidth  capability  of  representative 
hardware.  As  can  be  seen,  for  current  CPU  and  GPU  systems,  LBM  implementations  tend 
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to  be  memory  bandwidth  bound  rather  than  computationally  bound.  This  observation  has 
also  been  reported  in  the  literature  [53]. 


IBM:  Memory-bandwith  vs  Computational 
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Figure  33:  Memory  bandwidth  requirement  versus  desired  computational  throughput  for 
a  typical  LBM  implementation.  Modern  CPU  and  GPU  hardware  are  memory  bandwidth 
limited  for  LBM. 


B.  AN  OVERVIEW  OF  GPUS  AND  NVIDIA  CUDA 

The  purpose  of  this  section  is  to  familiarize  the  reader  with  the  basic  concepts  of 
GPU  architecture  and  CUDA  programming.  The  treatment  is  not  comprehensive  but  is 
intended  to  be  sufficient  so  that  implementation  details  and  design  decisions  made  in  the 
LBM  implementation  for  this  work  can  be  understood  within  the  context  of  programming 
on  NVIDIA  GPUs  with  CUDA.  For  a  more  detailed  treatment,  the  reader  is  directed  to 
[54]-[56]. 

I.  NVIDIA  GPU  Architecture 

A  simplified  schematic  of  the  architecture  of  an  NVIDIA  GPU  is  shown  in  Figure 
34.  The  GPU  device  is  attached  to  the  computer  and  communicates  with  external  compo¬ 
nents  via  the  PCIe  bus.  Within  each  device  is  a  number  of  streaming  multiprocessors  SMs 
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and  within  each  SM  is  a  number  of  scalar  processors  (SPs).  The  number  of  SMs  per  device 
and  the  number  of  SPs  per  SM  varies  between  generations  of  GPUs  and  among  particular 
models  in  a  given  generation.  Each  SM  contains  a  register  file  used  by  the  SPs  as  well  as 
a  memory  space  that  serves  partly  as  a  level  1  cache  and  a  user-managed  shared  memory. 
The  device  also  contains  a  bank  of  level  2  cache  as  well  as  a  bank  of  device  memory  that 
is  characterized  by  greater  capacity  but  slower  access. 


Figure  34:  Simplified  schematic  of  NVIDIA  GPU. 

The  NVIDIA  GPU  is  designed  for  highly  multithreaded  high-throughput  comput¬ 
ing.  Each  scalar  processor  is  assigned  a  thread  for  execution.  Threads  are  organized  in  a 
hierarchical  way  into  blocks  and  grids  as  illustrated  by  Figure  35.  Multithreaded  programs 
executing  on  the  GPU  are  “launched”  as  a  grid  of  thread  blocks.  Grids  of  threads  are  com¬ 
posed  of  at  three  dimensional  array  of  blocks.  The  blocks,  in  turn,  are  composed  of  a  three 
dimensional  array  of  threads.  The  blocks  are  distributed  among  the  SMs  for  execution. 
Each  of  the  SMs  execute  the  threads  contained  within  each  thread  block  independently 
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and  in  parallel  in  groups  called  thread  warps  comprised  of  32  threads  each.  The  individual 
warps  are  sent  as  a  group  among  the  SPs  contained  within  an  SM  for  execution.  All  threads 
within  a  warp  must  execute  the  same  instruction,  so  the  smallest  discrete  unit  of  parallelism 
is  the  thread  warp.  Once  all  of  the  thread  warps  in  a  block  are  complete,  the  SM  is  available 
to  be  assigned  more  blocks.  This  process  repeats  until  all  of  the  blocks  contained  in  a  grid 
have  executed. 
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1 

Thread  (3,  2) 

Figure  35:  Hierarchy  of  threads  in  a  CUDA  program.  Threads  are  organized  into  blocks; 
blocks  are  organized  into  a  grid  (From  [52]). 


2.  CUDA  C  Programming  Model 

The  CUDA  programming  model  is  designed  to  provide  the  programmer  a  way  to 
express  their  algorithms  in  a  program  that  can  execute  on  the  multithreaded  hardware  de¬ 
scribed  in  the  previous  section.  The  complete  program  is  decomposed  into  those  functions 
that  are  to  be  executed  on  the  GPU  which  are  called  kernels,  and  the  remainder  of  the  pro¬ 
gram  which  is  simply  written  in  the  chosen  programming  language.  The  CUDA  program- 
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ming  model  augments  the  C  programming  language  with  additional  syntax  and  keywords 
along  with  an  extensive  applieation  programming  interfaee  (API)  enabling  the  programmer 
to  to  effeet  eommon  programming  tasks  targeted  for  the  GPU  sueh  as  deviee  seleetion, 
memory  alloeation,  memory  transfer  and  memory  de-alloeation.  Funetions  that  are  written 
for  exeeution  on  the  GPU  are  denoted  with  the  ..global  qualifier  as  illustrated  in  the  eode 
listing  below. 

_ global  void  myKernelf float  *  argl,  const  int  arg2){ 

II...  kernel  body. . . 

} 

Within  the  C  souree  eode,  the  kernels  are  invoked  with  a  speeial  syntax  in  whieh  the  pro¬ 
grammer  speeifies  the  dimensions  of  the  grid  (i.e.,  eonfiguration  of  bloek  array  within  the 
grid  and  eonfiguration  of  the  thread  array  within  eaeh  bloek).  The  syntax  is  illustrated  in 
the  eode  listing  below: 

dim3  BlockSize(ThreadX,ThreadY,ThreadZ) ; 
dim3  GridSize(BlockX.BlockY,BlockZ) ; 
myKernel«<GridSize, BlockSize»>(argl, arg2) ; 

where  dim3  is  an  integer  veetor  defined  as  part  of  the  CUDA  API  that  is  used  to  express 
the  grid  and  bloek  dimensions. 

From  the  programmer’s  perspeetive,  sealing  of  thread  exeeution  among  SMs  and 
SPs  is  transparent.  It  is  not  even  neeessary  for  the  programmer  to  be  aware  of  how  many 
SMs  or  SPs  are  present  in  a  system.  Code  written  for  one  GPU  with  a  given  number  of 
SMs  and  SPs  will  run  without  modifieation  or  even  reeompilation  on  a  GPU  with  twiee  as 
many  of  eaeh;  it  will  just  run  faster. 

Some  restrietions  exist  in  the  CUDA  programming  model.  The  programmer  has 
no  eontrol  over  the  order  in  whieh  the  thread  bloeks  will  exeeute.  Additionally,  no  global 
thread  synehronization  within  the  eontext  of  a  single  kernel  is  possible.  Onee  started  eaeh 
thread  bloek  runs  to  eompletion;  the  programmer  ean  only  enforee  barrier  synehronization 
for  threads  within  an  individual  thread  bloek. 
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Like  other  shared-memory  programming  models  sueh  as  OpenMP,  CUDA  does 
faeilitate  shared  variables  between  individual  threads.  For  CUDA,  this  is  only  possible  for 
threads  within  a  single  thread  bloek.  This  is  done  using  the  shared  memory  bloek  within 
eaeh  SM.  Any  data  that  must  be  exehanged  with  threads  outside  of  the  thread  bloek  must  be 
done  via  global  memory  and,  sinee  exeeution  of  distinet  thread  bloeks  eannot  be  seheduled 
by  the  programmer,  it  must  be  done  asynehronously. 

A  programmer  planning  to  use  CUDA  to  obtain  high  performanee  must  design  their 
programs  so  as  to  make  maximum  use  of  eomputing  resourees  while  not  overloading  the 
available  memory  bandwidth.  Maximizing  use  of  the  eomputing  resourees  means  mapping 
the  most  fine-grained  level  of  parallelism  within  the  algorithm  to  the  grid  strueture  provided 
in  the  CUDA  programming  model  to  generate  as  many  threads  as  possible.  Using  LBM  as 
an  example  the  ehoiee  almost  universally  made  is  to  assign  all  of  the  eomputations  required 
for  a  single  lattiee  point  to  a  unique  CUDA  thread.  All  of  the  threads  for  all  of  the  lattiee 
points  would  then  be  mapped  into  bloeks  and  ultimately  a  thread  grid  for  exeeution.  For 
simulations  with  many  lattiee  points,  this  will  generate  suffieient  parallelism  to  keep  many 
of  the  SMs  and  their  assoeiated  SPs  busy  doing  produetive  work. 

Conservation  of  global  memory  bandwidth  is  done  in  two  ways:  first,  by  minimiz¬ 
ing  the  number  of  global  memory  transaetions;  seeond,  by  ensuring  that  the  global  mem¬ 
ory  transaetions  are  as  effieient  as  possible  by  ensuring  load  and  store  operations  to  global 
memory  are  eoaleseed.  For  GPU  programming  of  the  LBM,  there  is  mueh  less  agreement 
in  how  best  to  aehieve  the  goal  of  making  best  use  of  available  memory  resourees.  This  is 
not  surprising  sinee  effieient  use  of  memory  bandwidth  is  the  most  important  determiner  of 
program  performanee.  Some  of  the  design  alternatives  will  be  diseussed  in  the  seetion  on 
LBM  implementation  with  CUDA. 

A  great  deal  of  effort  is  made  among  researehers  to  ensure  that  their  programs 
make  the  most  effeetive  use  possible  of  their  GPUs.  The  resulting  eodes  and  algorithms  are 
somewhat  more  eomplex  than  what  will  be  deseribed  in  this  work.  Instead  of  employing 
every  last  areane  triek  to  squeeze  the  last  epsilon  of  effieieney  from  the  GPU,  this  work 
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will  demonstrate  the  effectiveness  of  straight-forward  though  efficiently  written  code  that 
seeks  first  to  embody  the  most  important  GPU  programming  best  practices.  A  collection  of 
these  high-level  CUDA  programming  best  practices  is  provided  by  NVIDIA  in  [57].  Some 
of  these  guidelines  are  repeated  here: 

1.  Minimize  data  transfer  between  the  host  and  the  device.  Data  transfer 
should  be  avoided  even  if  it  means  running  some  kernels  on  the  device 
even  when  they  do  not  show  significant  performance  gains  compared  to 
running  those  functions  on  the  CPU. 

2.  Ensure  global  memory  accesses  are  coalesced  whenever  possible.  Se¬ 
quences  of  threads  in  a  warp  should,  whenever  possible,  access  sequential 
locations  of  memory.  When  this  is  done  the  reading/writing  operations  can 
be  done  in  a  single  memory  access. 

3.  Minimize  the  use  of  global  memory.  Prefer  shared  memory  access  where 
possible. 

4.  Avoid  different  execution  paths  within  the  same  warp.  Use  of  if-then- 
else  control  structures  result  in  the  requirement  for  warps  to  traverse  code 
segments  multiple  times  as  threads  within  the  warp  take  different  control 
paths.  This  is  termed  thread  divergence. 

C.  LBM  IMPLEMENTATION  WITH  CUDA 

In  this  section,  the  implementation  strategy  employed  for  this  work  will  be  outlined. 
A  great  deal  of  software  was  written  for  this  work,  mainly  comprising  the  multitude  of 
experiments  in  data  layouts,  grid  setups  and  register  usage  strategies  aimed  at  obtaining 
high  performance  while  maintaining  some  modicum  of  code  flexibility.  In  the  following 
sections,  both  the  basic  implementation  as  well  as  steps  taken  towards  optimization  will  be 
reviewed  in  turn. 
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1.  Basic  Implementation 

The  basic  implementation  of  the  LBM  on  the  GPU  is  discussed  in  this  section.  The 
discussion  is  broken  up  into  two  parts.  First,  the  essential  calculations  required  for  the 
LBM  routine  are  considered.  Second,  the  question  of  how  to  best  arrange  the  main  LBM 
variables — fa  for  all  of  the  lattice  points — in  memory. 

a.  LBM  Routine 

Every  LBM  routine  must  provide  for  certain  identifiable  milestones.  These 
are  briefly  listed  below: 

•  Problem  initialization. 

•  Computation  of  macroscopic  flow  properties  such  as  p  and  u. 

•  Enforcement  of  boundary  conditions  to  force  the  proper  flow  and 
solve  the  correct  problem. 

•  Collision  to  relax  towards  equilibrium. 

•  Streaming  to  propagate  information  across  the  EBM  grid. 

•  Exporting  of  data  to  allow  post-processing. 

Several  methods  for  initializing  the  values  of  fa  at  each  lattice  point  have 
been  analyzed  in  the  literature  [58].  Eor  this  work,  all  lattice  points  are  initialized  by  setting 
u  =  0  and  p  equal  to  the  nominal  density  of  the  fluid  to  be  used  in  the  simulation.  Then 
f^  is  computed  using  Equation  10.  These  tasks  can  either  be  done  with  the  CPU  prior  to 
copying  the  lattice  data  to  the  GPU  or  it  can  be  implemented  in  a  separate  kernel  prior  to 
commencing  time  stepping. 

Computing  of  macroscopic  properties  and  enforcement  of  boundary  condi¬ 
tions  are  frequently  done  in  conjunction  with  the  collision  step.  This  is  done  because  the 
macroscopic  properties  are  often  only  required,  within  the  context  of  the  EBM  simulation, 
for  calculation  of  f^  which  is  required  for  collision  and,  for  some  schemes,  boundary 
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condition  enforcement.  To  eompute  maeroseopie  properties  separate  from  either  boundary 
eondition  enforeement  or  eollision  would  require  storing  the  values  in  global  GPU  memory. 
For  this  reason,  in  light  of  the  CUDA  programming  guidanee  to  minimize  global  memory 
transaetions,  the  steps  of  eomputing  maeroseopie  flow  properties,  boundary  eondition  en- 
foreement  and  eollision  are  always  done  in  the  same  kernel. 

The  streaming  step  for  the  elassieal  LBM  is  simply  a  data  eopy  operation. 
While  this  is  simple  to  implement,  it  is  the  subjeet  of  mueh  researeh  as  to  how  to  best 
exeeute  the  streaming  step  in  a  way  that  memory  aeeesses  are  eoaleseed. 

Lastly,  any  simulation  is  pointless  if  there  is  no  way  to  evaluate  the  results. 
For  this  work,  intermediate  values  for  the  fluid  veloeity  and  pressure  field  were  periodieally 
transferred  from  the  GPU  to  CPU  and  written  to  disk  using  Visualization  Toolkit  (VTK)  file 
formats.  For  FSI  eomputations,  displaeement,  veloeity  and  aeeeleration  data  was  similarly 
stored  for  later  post-proeessing. 

b.  Data  Layout 

The  two  prineipal  alternative  data  layouts  for  LBM  eomputations  are  the 
so-ealled  AoS  or  SoA.  The  two  alternatives  are  illustrated  sehematieally  in  Figure  36.  In 
AoS,  the  density  distribution  values  fa  for  a  given  lattiee  point  are  assigned  in  eonseeutive 
memory  loeations.  In  SoA,  the  density  distribution  for  all  of  the  lattiee  points  for  a  given 
lattiee  speed  are  assigned  eonseeutive  memory  loeations;  these  are  followed  by  the  density 
distribution  funetion  for  the  next  veloeity  and  so-on  until  all  of  the  data  has  a  loeation. 

For  LBM  ealeulations  eondueted  on  the  CPU,  it  is  most  appealing  use  the 
AoS  sinee  this  will  allow  the  CPU  to  aeeess  sequential  memory  loeations  while  aeeessing 
the  data  for  a  partieular  lattiee  point.  This  will  allow  for  effieient  memory  transfers  as  well 
as  effeetive  use  of  the  memory  eaehe  hierarehy.  In  eontrast,  most  LBM  implementations 
on  the  GPU  use  the  SoA  approaeh.  With  the  SoA,  when  data  is  loaded  from  memory 
within  a  kernel,  eaeh  thread  in  a  given  warp  reads  from  eonseeutive  memory  loeations  as 
illustrated  in  Figure  37.  When  loads  are  eoaleseed  in  this  fashion,  the  data  is  transferred 
from  memory  in  a  single  transaetion.  A  similar  eondition  exists  during  store  operations  as 
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Consecutive  memory  locations 


■> 


(a) 


(b) 


Figure  36:  Schematic  of  data  layout  schemes,  (a)  depicts  the  AoS,  (b)  depicts  the  SoA. 
Superscripts  indicate  lattice  node  number,  subscripts  indicate  the  lattice  velocity. 


well.  This  is  in  conformance  with  the  guidance  to  ensure  coalesced  memory  access.  As  a 
rule  of  thumb,  using  the  AoS  approach  on  the  GPU  penalizes  achievable  performance  by  a 
factor  of  approximately  two. 


float  f0  =  fEven(0*nnodes+tid] ; 


float  fl  =  fEven[l*nnodes+tid] ; 
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Figure  37:  When  using  SoA,  load  instructions  executed  by  consecutive  threads  read  from 
consecutive  locations  in  memory 
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2.  Optimization 

Once  a  basic  implementation  has  been  prepared,  tested  and  validated,  it  is  time  to 
look  for  ways  to  improve  performance.  For  this  work,  the  optimization  strategies  taken 
resulted  in  a  four-fold  increase  in  performance  over  the  baseline  GPU-accelerated  imple¬ 
mentation.  The  principal  optimization  strategies  fall  in  the  following  categories: 

•  Kernel  structure 

•  Register  versus  shared  memory  trade-offs;  and 

•  Thread  block  dimensions. 

The  details  are  discussed  in  the  subsections  below. 

a.  Kernel  Structure 

For  this  work,  all  computations  required  to  execute  a  single  time  step  on 
each  lattice  point  are  collected  into  a  single  kernel.  This  implies  a  number  of  compromises. 
First,  since  the  order  of  execution  of  blocks  of  threads  is  outside  of  programmer  control,  a 
second  lattice  is  used  so  that  one  lattice  is  active  with  each  time  step.  The  LBM  collision 
and  boundary  conditions  are  enforced  on  the  active  lattice  and  the  result  is  streamed  to  the 
alternate  lattice.  This  results  in  a  reduction  of  memory  bandwidth  demand  by  a  factor  of  two 
from  the  naive  implementation  while  paying  the  cost  of  doubling  the  memory  consumption. 

Second,  this  unified  times  step  kernel  structure  also  imposes  a  penalty  on 
the  modularity  of  the  LBM  code.  Any  alternative  selection  of  boundary  conditions  or 
relaxation  schemes  necessitates  construction  of  a  new  kernel.  This  penalty  is  emphasized 
by  some  authors  in  the  literature  [50].  It  was  found  during  this  work  that  the  structure  of  the 
kernel  itself  is  modular,  and  amenable  to  systematic  construction.  Individual  components 
of  the  kernel  time-step  could  be  “cut-and-paste”  into  a  basic  kernel  to  provide  the  desired 
customization.  This  is  a  dubious  software  engineering  practice  when  done  manually;  if 
automated  through  meta-programming,  however,  it  can  become  a  powerful  tool. 


62 


streaming 


Figure  38:  Schematic  of  the  dual  lattice  scheme  used  to  support  a  unified  time  step  kernel. 
On  even  time  steps,  the  Even  Lattice  is  active  and  it  collides  and  streams  to  the  Odd  Lattice; 
vice  versa  for  odd  time  steps. 

b.  Registers  versus  Shared  Memory 

The  kernels  used  in  this  research  made  heavy  use  of  registers.  Every  density 
distribution  function  for  a  given  lattice  point  was  assigned  its  own  register  value.  Additional 
sets  of  registers  were  used  for  and  any  other  temporary  value  needed.  The  result  was 
that  the  global  arrays  holding  the  values  for  fa  were  only  accessed  at  the  beginning  of  the 
time  step  as  the  data  is  loaded  into  the  SM  and  at  the  end  of  the  time  step  for  streaming  to  the 
alternate  array.  This  practice  resulted  in  some  awkward  program  structures;  since  register 
variables  cannot  be  indexed,  all  loops  were  completely  unrolled.  This  negatively  impacts 
program  maintainability,  however  as  mentioned  previously,  these  steps  are  amenable  to 
automation.  This  register  use  is  a  key  element  to  the  strategy  to  minimizing  global  memory 
accesses. 

An  alternate  strategy  would  be  to  store  the  density  distribution  values  in 
shared  memory  instead  of  registers.  This  practice  is  avoided  in  this  work  for  several  rea¬ 
sons.  First,  for  the  NVIDIA  GPU  architecture  has  a  relatively  large  32K  register  file.  Sec- 
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Memory  Type 

Bandwidth  (^) 

Register  Memory 

^  8,000 

Shared  Memory 

^  1,600 

Global  Memory 

192 

Mapped  Memory 

8  (one-way) 

Table  4:  Memory  bandwidth  of  various  CUDA  memory  spaces  on  an  NVIDIA  GTX-580 
GPU 

ond,  use  of  shared  memory  incurs  and  overhead  of  integer  arithmetic  for  shared  memory 
array  indexing.  Third,  though  the  memory  bandwidth  between  shared  memory  and  the  SPs 
is  an  order  of  magnitude  faster  than  global  memory,  it  is  much  slower  than  the  bandwidth 
between  the  register  files  and  the  SPs  [56].  The  bandwidth  of  various  CUDA  memories 
is  summarized  in  Table  4.  Despite  the  speed  of  shared  memory,  when  combined  with  the 
integer  arithmetic  overhead  and  limited  bandwidth,  it  has  been  shown  that  contemporary 
GPUs  are  unable  to  achieve  peak  performance  when  using  shared  memory  [56]. 

The  last  reason  is  that  shared  memory  is  not  used  simply  because  individ¬ 
ual  values  of  fa  are  often  not  shared  between  threads.  Shared  memory  should  be  utilized 
for  variables  that  are  shared  such  as  p(x,  t)  in  multi-component  models,  or  the  relaxation 
matrix  in  MRT  collision  models.  In  these  cases,  shared  memory  is  the  only  option  for  effi¬ 
ciently  exchanging  information  between  threads  and  this  resource  should  not  be  squandered 
when  more  efficient  mechanisms  for  storing  intermediate  non- shared  data  like  registers  are 
available. 


c.  Thread  Block  Dimensions 

Programs  written  for  execution  on  the  GPU  are  typically  very  sensitive  to 
the  specific  layout  of  the  thread  grid.  The  three-dimensional  array  of  thread  blocks  can 
have  as  many  as  65,535  blocks  in  any  dimension.  The  three-dimensional  thread  blocks 
are  somewhat  more  limited;  for  Fermi-class  NVIDIA  GPUs,  the  maximum  thread  block 
dimension  is  1024  and  the  product  of  all  thread  dimensions  must  be  less  than  1532  [52]. 
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Performance  vs  Threads  per  Block 


Figure  39:  LBM  performance  on  GTX-580  for  three-dimension  lid-driven  cavity  as  a  func¬ 
tion  of  threads  per  block. 

These  limits  are  important  and  must  be  respected,  but  they  do  not  help  select  the  best  thread 
configuration  within  those  limits. 

As  an  experiment,  the  three-dimensional  lid-driven  cavity  problem  is  run 
with  a  series  of  one-dimensional  thread  blocks  arranged  in  a  one-dimensional  grid.  A 
D3Q15  lattice  structure  was  selected  with  LBGK  collision  operator  and  a  64  x  64  x  64 
lattice  was  used  for  a  total  of  262,144  lattice  points.  The  simulation  was  run  on  a  GTX-580 
GPU.  The  threads  per  block  was  varied  and  the  performance  fore  each  thread  block  size 
is  shown  in  Figure  39.  Notice  that  most  of  the  peaks  in  the  performance  plot  occur  when 
the  number  of  threads  per  block  is  a  multiple  of  32.  This  number  is  important  because  it 
implies  that  all  warps  within  the  block  will  be  fully  associated  with  useful  work  and  makes 
possible  fully  coalesced  global  memory  reads. 

The  general  trend  is  lower  performance  for  very  small  thread  blocks  and 
also  lower  performance  for  very  large  thread  blocks.  For  the  small  thread  block  sizes 
performance  suffers  because,  with  a  limitation  of  only  8  blocks  assigned  to  each  SM,  an 
insufficient  number  of  total  threads  is  kept  in  flight  to  hide  data  access  latencies  and  exploit 
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the  full  parallelism  of  the  GPU.  With  very  large  thread  blocks  the  reduced  performance  is 
a  combination  of  a  reduced  number  of  blocks  per  SM  due  to  register  resource  limitations 
resulting  in  an  overall  reduction  in  the  number  of  threads  in  flight  and  accompanying  loss 
of  parallel  performance.  With  an  intermediate  number  of  threads  which  is  also  a  multiple 
of  32,  more  blocks  are  assigned  to  each  SM,  more  total  threads  are  kept  in  flight,  SMs  have 
adequate  resources  to  service  the  threads  that  are  assigned  to  them.^ 

Though  this  was  just  one  example  case,  the  general  trends  are  the  same  for 
other  LBM  solvers  developed  for  this  work.  As  a  summary  for  thread  block  selection,  the 
following  summary  is  offered  for  use  as  a  guideline. 

•  Choose  thread  block  size  that  is  a  multiple  of  32.  This  will  en¬ 
sure  all  thread  warps  are  associated  with  useful  work  and  allow  for 
efficient  memory  access. 

•  Choose  a  thread  block  size  that  is  large  enough  to  ensure  enough 
threads  are  in  flight  to  hide  memory  access  latencies. 

•  Choose  a  thread  block  size  is  not  too  large  so  that  an  individual  SM 
has  adequate  resources  to  execute  at  least  one  block  at  a  time. 

•  For  most  problems,  a  good  starting  point  is  128  threads  per  block. 

These  guidelines  may  be  effective  as  a  starting  point  for  testing,  but  not  as  a  rule  that  may 
be  used  blindly.  Ultimately,  there  does  not  appear  to  be  an  alternative  to  experimentation 
to  find  the  thread  block  size  that  is  best  for  a  given  implementation  of  a  given  problem. 

D.  PERFORMANCE  BENCHMARK-3D  LID-DRIVEN  CAVITY 

In  order  to  compare  the  effectiveness  of  the  implementation  strategies  adopted  for 
this  work,  a  three-dimensional  lid-driven  cavity  problem  was  selected  as  a  benchmark. 

^The  main  concern  with  too  many  threads  is  register  spillage.  If  there  are  more  register  variables  declared 
in  a  kernel  than  can  be  accommodated  by  the  register  file,  these  excess  variables  are  “spilled”  to  local  memory. 
Local  memory  is  private  memory  to  each  thread  block  but  physically  it  is  located  on  device  global  memory. 
Despite  the  use  of  level  1  and  level  2  caching  on  new  generation  GPUs,  performance  is  degraded  when  this 
happens. 
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Device 

GTX-260 

GTX-295 

GTX-480 

GTX-580 

Number  of  CUDA  cores 

192 

240  X  2 

480 

512 

Global  Memory  (MB) 

896 

896  X  2 

1536 

1536 

Memory  Bandwidth  (GB/s) 

111.9 

111.9  X  2 

177.4 

192.2 

Estimated  Peak  Performance  (Gflops) 

805 

805  X  2 

1345 

1581 

Table  5:  Properties  of  GPU  devices  used  in  benchmark  computations  in  Figure  40 


Three  comparable  works  recently  published  in  the  literature  will  be  used  as  comparisons 
[49],  [50]  and  [45].  In  order  to  make  a  more  fair  comparison  between  all  of  the  results,  the 
reported  performance  figures  will  be  normalized  for  memory  bandwidth  capability  for  the 
GPU  device  on  which  each  comparable  result  was  computed.  The  relevant  characteristics 
of  these  devices  are  listed  in  Table  5.  The  normalized  performance  is  shown  in  Figure  40. 


Lid  Driven  Cavity  D3Q19 

(Scaled  for  Memory  Bandwidth) 

#  This  Work 

■  Rinaldi et  al.  (2012) 

A  Obrecht eta  1.(2010) 
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Figure  40:  Performance  benchmark  for  LBM  on  a  3D  lid-driven  cavity  scaled  for  device 
memory  bandwidth. 
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It  can  be  seen  from  Figure  40  that  the  result  of  this  work  is  comparable  to  or  bet¬ 
ter  than  other  recently  reported  implementations.  It  is  somewhat  surprising  that  the  results 
reported  by  Astorino  et  al.  in  [50]  are  so  poor.  In  their  paper,  emphasis  is  made  on  the  mod¬ 
ularity  and  generality  of  the  C-t-i-  implementation.  While  most  implementations,  including 
this  work,  combine  all  elements  of  a  LBM  time  step  within  a  single  kernel,  the  work  de¬ 
scribed  in  [50]  is  modular  in  the  sense  that  computations  concerning  boundary  conditions, 
collision  and  streaming  are  separated  into  separate  kernels  that  can  be  individually  main¬ 
tained.  While  it  is  conceded  in  this  work  that  software  engineering  considerations  such  as 
maintainability  and  ease  of  future  expansion  are  laudable,  those  considerations  often  take  a 
back  seat  to  performance  due  to  the  higher  demands  on  memory  bandwidth  due  to  the  need 
to  store  intermediate  variables. 

In  order  to  achieve  the  best  performance  for  each  problem  size,  the  number  of 
threads  per  block  must  be  adjusted  accordingly.  The  dependence  of  execution  performance 
on  the  thread  block  size  is  illustrated  in  Figure  41.  Using  96  threads  per  block  performs 
well  for  all  problem  sizes,  while  the  use  of  256  threads  in  a  block  performs  very  poorly  for 
all  problem  sizes. 

E.  HYBRID  PARALLEL  LBM 

Though  excellent  execution  performance  on  LBM  problems  can  be  achieved  when 
the  program  is  executed  on  a  GPU,  it  is  inevitable  that  the  need  will  arise  to  perform  sim¬ 
ulations  for  which  the  basic  data  variables  are  too  numerous  to  store  in  the  global  memory 
of  a  single  GPU.  The  programs  ultimately  have  to  scale  to  multiple  devices.  For  this  work, 
the  GPU-based  implementation  with  CUDA  on  an  NVIDIA  GPU  is  coupled  with  tradi¬ 
tional  parallel  programming  methodologies.  First,  the  GPU- accelerated  LBM  code  will  be 
augmented  with  OpenMP  directives  to  allow  execution  on  multiple  GPUs  installed  on  a 
single  workstation.  Second,  the  GPU-accelerated  code  will  instead  be  employed  within  a 
distributed  programming  environment  using  MPI. 
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Figure  41:  LBM  on  a  3D  lid-driven  cavity  with  various  number  of  threads  per  block. 


69 


1.  CUBA  with  OpenMP 

When  multiple  GPUs  are  used,  it  may  be  taken  to  imply  that  a  distributed  memory 
programming  model  must  be  employed.  Since  none  of  the  devices  are  capable  of  han¬ 
dling  the  entire  domain,  it  is  logical  that  the  domain  must  be  partitioned  with  relevant  data 
distributed  among  the  devices.^  However,  for  workstations  equipped  with  multiple  GPUs, 
the  use  of  CUDA  alongside  of  OpenMP  is  an  attractive  option.  Since  all  of  the  devices 
are  physically  located  on  the  same  machine,  it  is  convenient  to  use  the  shared-memory 
paradigms  of  OpenMP  rather  than  explicitly  dealing  with  the  message  passing  API  of  MPI. 

For  this  work,  a  workstation  equipped  with  six  NVIDIA  C2070  GPUs  is  employed 
to  execute  the  three-dimensional  lid-driven  cavity  problem  with  both  CUDA  and  OpenMP. 
The  problem  domain  is  partitioned  geometrically  among  the  available  devices  and  inter¬ 
partition  boundary  data  is  exchanged  between  the  devices  in  a  peer-to-peer  mode.  Though 
the  bandwidth  capability  of  the  PCIe  bus  is  still  a  limitation,  the  demand  is  comparatively 
low  since  only  boundary  data  is  exchanged. 

Performance  results  for  the  three-dimensional  lid  driven  cavity  problem  is  shown  in 
Figure  42.  For  the  simulation  a  D3Q15  lattice  was  used  with  LBGK  collision  operator.  The 
lattice  size  was  set  to  500x500x500  lattice  for  a  total  of  125  million  lattice  points.  It  can  be 
seen  that  the  scaling  is  somewhat  less  than  linear.  This  is  attributed  primarily  to  the  fact  that 
this  particular  multi-GPU  implementation  that  does  not  overlap  computations  with  peer-to- 
peer  communications.  Future  CUDA/OpenMP  will  incorporate  this  optimization  and  is 
expected  to  improve  scalability. 

2.  CUDA  with  MPI 

Though  using  CUDA  in  conjunction  with  OpenMP  makes  it  possible  to  employ 
multiple  GPUs  to  bring  larger  problem  sizes  into  reach,  this  solution  still  has  limitations  in 

^Strictly  speaking,  this  is  not  true.  With  recent  GPUs  operating  computers  with  sufficient  physical  mem¬ 
ory  and  a  64-bit  operating  system,  it  is  possible  to  maintain  the  entire  problem  in  the  (usually  larger)  CPU 
system  memory  with  each  sub  domain  logically  mapped  to  the  GPU  that  will  be  assigned  to  carry  out  its 
computations.  Unfortunately,  the  memory  transfer  overhead  across  the  comparatively  slow  8  GB/sec  PCIe 
bus  from  the  CPU  memory  to  the  GPU  renders  this  convenient  programming  technique  impractical. 
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Figure  42:  Lid-driven  cavity  using  a  D3Q15  lattice  with  500^  points  using  CUDA  and 
OpenMP. 


scalability.  While  it  is  conceivable  that  workstations  will  be  constructed  capable  of  hosting 
more  than  six  GPUs  and  that  each  GPU  will  gradually  increase  its  memory  size  thus  be 
capable  of  handling  larger  problems,  the  need  currently  exists  to  model  fluid  problems 
using  LBM  that  require  billions  of  lattice  points  and  must  be  simulated  for  millions  of  time 
steps.  In  order  to  scale  to  these  problem  sizes,  it  is  necessary  to  develop  LBM  codes  that 
can  be  deployed  cooperatively  on  an  arbitrarily  large  number  of  nodes. 

The  standard  programming  paradigm  for  programs  of  this  type  is  to  use  a  distributed 
parallel  programming  model  based  on  MPI.  When  developing  a  program  that  uses  MPI  it 
is  critically  important  that  the  algorithm  is  implemented  such  that  necessary  computations 
are  interleaved  with  any  communication  requirements  that  may  exist  for  the  problem.  In 
addition  to  high-performance  CPUs,  the  other  key  feature  that  the  hardware  on  which  the 
code  is  to  run  must  posses  is  high-speed  interconnects.  For  this  work,  a  test  code  that 
used  only  MPI  with  CPUs  was  developed  to  analyze  three-dimensional  Poiseuille  flow. 
The  lattice  points  throughout  the  domain  were  partitioned  geometrically  and  distributed  as 
evenly  as  possible  to  all  available  processors. 
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Figure  43:  Schematic  LBM  time  step  for  distributed  computing  with  MPI.  Scalability  is 
achieved  by  interleaving  communication  with  computation. 


To  allow  for  scalability,  the  lattice  points  assigned  to  each  processor  were  further 
partitioned  into  boundary  points  and  interior  points.  The  LBM  time  step  computations  were 
performed  for  the  boundary  points  in  each  sub  domain  first.  Once  these  computations  were 
complete,  the  values  of  fa  that  are  streamed  out  of  each  sub  domain,  and  correspondingly 
streamed  in  to  neighboring  sub  domains,  are  exchanged  using  non-blocking  MPI  commu¬ 
nication  protocols.  Concurrent  with  this  communication  process,  all  processors  executed 
the  computations  for  the  LBM  time  step  on  their  interior  lattice  points.  The  overall  process 
is  illustrated  in  Figure  43.  The  performance  results  for  weak  scaling  are  presented  in  Figure 
44.  For  each  test  case,  the  lattice  size  was  adjusted  to  maintain  a  constant  40x40x16  lattice 
size  for  each  MPI  process. 

A  notable  result  that  can  be  drawn  from  Figure  44  is  that  a  large  number  of  CPU 
cores  are  required  to  achieve  performance  comparable  with  a  single  high-performance 
GPU.  Recalling  that  the  LBM  is  a  memory-bandwidth  constrained  problem,  the  expla¬ 
nation  for  the  large  number  of  CPU  cores  required  is  the  comparatively  limited  memory 
bandwidth  that  each  CPU  core  has  available  to  it.  The  hardware  on  which  this  test  was 
run  was  a  cluster  containing  32  dual  quad-core  Xeon  processors.  Though  they  were  inter¬ 
connected  with  high  speed  interconnects,  each  pair  of  CPUs  shared  an  aggregate  memory 
bandwidth  of  only  approximately  8  Comparing  this  with  the  192  ™  memory  band- 
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LBM  Parallel  Performance- D3Q15  Lattice 
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Figure  44:  Weak  scaling  using  MPI  for  LBM  simulation  of  three-dimensional  Poiseuille 
flow. 

width  of  an  NVIDIA  GTX-580  GPU,  having  a  memory  bandwidth  advantage  of  24  to  I.  In 
this  light,  it  is  not  surprising  that  nearly  200  CPU  cores  were  required.  It  is  conceded  that 
higher  performance  computer  systems  would  have  yielded  better  CPU  results. 

The  key  insight  that  this  work  seeks  to  demonstrate  is  that  the  GPU  brings  concen¬ 
trated  scalable  performance.  A  collection  of  GPUs  which  individually  have  high  memory 
bandwidth,  coupled  with  high-speed  interconnects,  can  effectively  combine  their  aggre¬ 
gate  memory  bandwidth  and  compute  capability  using  fewer  devices  than  is  necessary  with 
CPUs.  This  will  only  remain  true  so  long  as  GPUs  maintain  their  overwhelming  superior¬ 
ity  in  memory  bandwidth.  When  that  advantage  is  lost,  the  GPUs  will  also  have  lost  their 
advantage  for  high  performance  scientific  computing. 

To  demonstrate  the  scalability  of  a  hybrid  parallel  LBM  implementation  with  CUDA 
and  MPI,  the  same  three-dimensional  Poiseuille  flow  was  tested  on  a  system  with  two  nodes 
and  two  GPUs  per  node.  The  results  are  presented  in  Figure  45.  In  this  instance,  the  results 
are  not  as  encouraging  as  that  obtained  using  CUDA  and  OpenMP  This  is  attributed  at 
least  in  part  to  the  fact  that  the  CUDA/OpenMP  case  was  able  to  take  advantage  of  highly 
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Figure  45:  Weak  scaling  using  CUDA  and  MPI  for  LBM  simulation  of  three-dimensional 
Poiseuille  flow. 


efficient  peer-to-peer  memory  transfers  to  exchange  boundary  data.  Future  revisions  to  the 
CUDA  API  are  expected  to  inter-operate  with  “CUDA-aware”  MPI  implementations  that 
will  allow,  at  least  from  a  programmer’s  perspective,  seamless  peer-to-peer  memory  trans¬ 
fers  among  GPU  devices  associated  with  different  MPI  processes.  It  is  expected  that  this 
feature  will  improve  the  scalability  of  CUDA/MPI  hybrid  parallel  programs. 
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V.  FLUID-STRUCTURE  INTERACTION  WITH  LBM 


The  principle  goal  of  this  research  is  to  investigate  the  use  of  LBM  for  FSI.  With 
a  working  collection  of  LBM  tools  to  model  a  variety  of  flow  conditions  in  both  two  and 
three  dimensions  it  is  necessary  to  couple  these  tools  with  the  requisite  structural  models  to 
analyze  FSI.  Following  a  brief  literature  review,  the  coupled  FSI  problem  will  be  explored 
in  turn  for  the  2D  and  3D  case.  The  chief  products  of  these  investigations  are: 

•  A  collection  of  example  problems  to  be  compared  with  previous  work  and 
established  benchmarks.  Each  example  problem  illustrates  qualitatively 
the  coupled  interaction  between  the  fluid  flow  and  a  linear  elastic  structure 
with  small  displacements. 

•  A  novel  algorithm  for  exploiting  task-level  parallelism  inherent  in  the  FSI 
problem  that  exploits  both  the  CPU  as  well  as  the  GPU  to  make  best  use 
of  computational  resources  and  achieve  high  performance. 

Despite  these  advances,  significant  work  remains  to  be  done  in  this  area  in  order  to 
allow  reliable  and  effective  FSI  over  a  range  of  interesting  problem  domains.  In  particular, 
geometric  and  material  nonlinearity  must  be  incorporated  into  the  material  model  in  order 
to  quantitatively  match  benchmark  data.  Additionally,  in  order  to  accommodate  the  large 
structural  displacements,  the  LBM  model  must  be  further  expanded  to  account  for  lattice 
points  transitioning  from  the  fluid  domain  to  that  subdomain  covered  by  the  solid  when  the 
structure  undergoes  large  displacements. 

A.  INTRODUCTION  AND  LITERATURE  REVIEW 

The  interaction  of  fluid  flow  with  elastic  structures  and  suspended  particles  are  com¬ 
monly  encountered  problems  in  many  practical  engineering  applications.  Use  of  the  LBM 
for  the  fluid  modeling  is  common,  nonetheless  to  the  best  the  author’s  knowledge  no  sub¬ 
stantial  work  has  been  done  to  apply  LBM  to  solve  the  equations  of  structural  dynamics. 
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Consequently,  several  methods  have  been  proposed  to  couple  the  LBM  with  more  tradi¬ 
tional  structural  dynamics  solvers  in  order  to  collectively  capture  both  the  fluid  and  struc¬ 
tural  dynamics  aspects  FSI  problems. 

Some  recent  authors,  such  as  [59],  [60]  and  [61],  have  used  the  immersed  boundary 
(IB)  method  along  with  LBM.  In  an  IB  method,  the  fluid  problem  is  solved  over  a  fixed 
Eulerian  grid  that  covers  the  entire  domain.  The  immersed  boundary  is  solved  on  a  moving 
Lagrangian  array  of  points  overlying  a  portion  of  the  Eulerian  grid.  The  force  that  the  fluid 
exerts  on  the  moving  boundary  is  projected  onto  the  Eagrangian  node  points  by  use  of  the 
Dirac  delta  function  as  described  in  [62].  This  solution  procedure  has  been  extensively 
used  for  studying  ESI  applications,  particularly  for  biological  flows  [63]. 

While  the  IB  method  has  been  popular  for  modeling  of  highly  flexible  structures, 
a  common  methodology  for  multibody  coupling  with  the  EBM  is  the  discrete  element 
method  (DEM).  This  method  combines  problems  addressed  by  the  coupled  IB  and  EBM 
solvers,  and  indeed,  use  the  same  methods  for  hydrodynamic  coupling  between  the  fluid 
and  solid,  but  also  account  for  the  solid  body-to-body  interactions  and  has  been  extensively 
used  for  particle  and  granular  flow  problems.  A  succinct  review  of  the  use  of  EBM  and 
DEM  is  provided  in  [64].  As  a  few  examples,  this  approach  has  found  use  in  investigation 
of  sedimentation  of  particles  in  fluid  at  low  Reynolds  numbers  [65],  particulate  flows  [66], 
and  particle  transport  in  turbulent  fluid  flows  [67]. 

The  EEM  has  a  long  history  of  use  for  problems  in  both  structural  and  fluid  dynam¬ 
ics.  As  can  be  expected,  the  EEM  as  also  been  applied  to  ESI  problems  where  the  EEM 
is  used  in  both  the  fluid  and  solid  domains  [68],  [69]  and  [70],  for  example.  In  this  work, 
the  advances  in  coupling  EBM  with  EEM  for  structural  dynamics  presented  in  [28]  will  be 
used  as  a  starting  point  for  further  ESI  studies. 

B.  FORCE  EVALUATION 

All  aforementioned  approaches  to  coupled  ESI  problems  share  the  issue  of  needing 
to  determine  how  forces  and  momentum  inputs  from  one  domain  are  to  be  transmitted  into 
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the  other.  For  monolithic  approaches,  such  as  using  FEM  both  for  the  fluid  and  structural 
domains,  the  transfer  of  this  data  may  be  a  natural  part  of  the  discretization  and  satisfaction 
of  continuity  equations.  For  the  coupled  LBM  and  FEM  approach  undertaken  in  this  work, 
it  is  necessary  to  obtain  the  force  that  the  fluid  imparts  upon  this  structure  at  the  fluid-solid 
interface.  Two  approaches  will  be  discussed  in  this  section: 

1.  Stress  Integration  Approach;  and 

2.  Momentum  Response  Approach 

1.  Stress  Integration  Approach 

This  method  was  introduced  by  He  and  Doolen  in  [71],  where  they  evaluated  the 
forces  on  a  cylinder  in  channel  flow  by  integrating  the  total  stresses  on  the  surface  of  the 
cylinder  by  evaluating  Equation  45, 


F 


n  ■ 


p\  +  pv  [  Vu  +  (Vu) 


dA 


(45) 


where  F  is  the  force  vector,  n  is  the  outward  facing  normal  of  a  the  solid  surface,  and  u  is 
the  kinematic  viscosity. 

The  first  term  of  Equation  45  is  easy  to  evaluate  within  the  EBM  framework  using 
the  simple  relation  of  Equation  13  from  Chapter  II.  The  second  term  in  Equation  45  is  the 
deviatoric  stress  for  incompressible  flow  as  given  in  Equation  46. 


Tij  =  pv  (^Vu  +  (Vu)^j  (46) 

Equation  46  can  be  evaluated  by  computing  the  macroscopic  velocity  throughout  the  do¬ 
main  using  Equation  12,  then  a  using  a  discrete  differencing  scheme  to  compute  the  spatial 
partial  derivatives  and  evaluate  with  given  values  of  viscosity.  This  methodology  does 
not  sit  well  with  the  general  theme  of  EBM  whereby  computations  should  be  performed 
locally  to  a  single  lattice  point.  In  keeping  with  the  general  theme  of  locality,  in  EBM  is 
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computed  using  the  non-equilibrium  portion  of  the  particle  density  distribution  function; 

=  fa  —  In  the  case  for  a  D2Q9  lattice,  at  each  lattice  point  this  is  done  using 
Equation  47 


Tij  =  (^1  -  ^  [f»  t)  -  (x,  f)]  X  ■  eJij 


(47) 


Once  the  integrand  for  Equation  45  is  computed,  executing  the  integral  can  be  done  numer¬ 
ically  with,  for  the  D2Q9  lattice  with  Equation  48. 


F  =  ^eQ,-  pi -f  pz/ ^Vu -f  (Vu 


0=1 


Sx 


(48) 


In  summary,  to  use  the  stress  integration  approach  for  evaluating  fluid  forces  on  a 
structure,  given  fluid  viscosity  and  the  current  set  of  density  distribution  functions  fa'- 


1 .  Compute  density  using  Equation  1 1 . 

2.  With  this  density,  compute  pressure  using  Equation  13. 

3.  Compute  macroscopic  velocity  using  Equation  12. 

4.  Compute  using  Equation  10. 

5.  Compute  local  deviatoric  stress  via  Equation  47. 

6.  Compute  force  from  surface  stress  with  Equation  48. 

2.  Momentum  Response  Approach 

Instead  of  the  stress  integration  method,  a  momentum  exchange  method,  developed 
by  Ladd  in  [72],  can  be  used  to  compute  the  fluid  force  on  closed  surfaces  suspended  in  the 
flow  field.  In  this  method,  the  total  force  acting  on  a  solid  body  is  obtained  by  Equation  49, 
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(49) 


Nd 

F  =  ^  ^  [/„(xb,  t)  +  fa  (xfe  +  eaSt,  t)] 

aWxi)  a=l 

where  Gq,  is  the  lattice  direction  opposite  of  Oq. 

One  disadvantage  of  the  momentum  response  as  shown  in  Equation  49  is  that  it 
makes  use  of  values  of  fa  from  neighboring  lattice  points.  This  non-locality  can  compli¬ 
cate  parallel  implementation.  For  problems  with  a  large  number  of  lattice  points  on  fluid/- 
solid  boundaries,  the  need  to  access  additional  values  of  fa  can  add  to  the  overall  memory 
bandwidth  demand  and  thus  degrade  performance  for  this  memory  bandwidth-bound  ap¬ 
plication. 

C.  COUPLING  PROCEDURE 

Once  the  relevant  forces  have  been  computed  on  the  fluid  domain,  they  must  be 
satisfactorily  transferred  to  the  discrete  representation  of  the  structural  domain  on  which 
they  will  be  imparted.  Respectively,  once  the  equations  of  motion  have  been  solved  on 
the  structural  domain,  the  inputs  relevant  to  the  coupled  fluid  problem  must  be  passed 
to  the  fluid  domain.  This  straight-forward  problem  can  become  complicated  due  to  the 
possibility  that  the  discrete  representation  of  the  fluid  and  solid  may  not  conform  exactly 
at  the  interface.  In  cases  where  the  meshes  do  not  conform,  some  form  of  interpolation 
will  be  required  in  order  to  map  forces  applied  in  one  domain  to  degrees  of  freedom  in  the 
other. 

For  this  work,  problems  were  restricted  to  those  where  the  fluid  and  structural  dis¬ 
cretizations  would  conform  at  domain  boundaries.  This  approach  greatly  simplifies  com¬ 
munication  of  information  between  domains  at  the  cost  of  severely  restricting  the  types  of 
problems  that  can  be  analyzed.  Specifically,  the  geometric  non-linearity  of  a  moving  solid 
mesh  covering  and  uncovering  fluid  lattice  points  is  not  addressed  within  the  scope  of  the 
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present  work;  data-flow  between  the  meshes  will  be  addressed  as  in  Equation  50. 


Pfluid(x,t)  -^Psolid(x,t) 


Usolid(x,t)  -)■  Ufluid(x,t) 


(50) 


Onee  the  data  has  been  transferred  between  the  fluid  and  solid  domain  meshes,  proeedures 
appropriate  for  eaeh  domain  are  applied  to  apply  those  boundary  eonditions  to  the  diserete 
model.  Future  implementations  will  address  this  shorteoming  and  implement  appropriate 
interpolation  and  extrapolation  sehemes  for  boundary  data  eommunieation. 

D.  FLUID-STRUCTURE  INTERACTION  IN  TWO  DIMENSIONS 


All  ESI  simulations  eondueted  in  two  dimensions  use  the  D2Q9  lattiee  for  the 
EBM  solution  to  the  fluid  domain.  For  all  struetural  eomponents  modeled  within  the  two- 
dimensional  ESI  problem,  Euler-Bemoulli  beam  elements  with  linear  elastie  eonstitutive 
models  are  used  for  the  diserete  struetural  model. 

1.  Structural  Model 

A  formulation  of  the  Euler-Bernoulli  beam  type  is  presented  in  [73];  a  sehematie  is 
given  in  Figure  46.  There  are  two  degrees  of  freedom  per  node;  displaeement  v  and  rotation 
9.  The  beam  materials  to  be  used  are  listed  in  Table  6. 


Aj  =0  =  / 


Figure  46:  Euler-Bemoulli  Beam. 
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Material 

P  (a 

E  {^)  X  106 

steel 

7,800 

210,000 

cork 

180 

32 

PVC 

1,400 

1,500 

Table  6:  Selected  structural  material  properties  used  for  FSI  simulations. 


Eluid 

p  (5) 

z/  f  — )  X  10-6 

\  sec  J 

ethyl  alcohol 

790 

1.4 

vegetable  oil 

920 

76.1 

water 

1000 

1.14 

blood 

1035 

4 

glycerin 

1260 

1127 

Mercury 

13594 

0.0114 

Table  7:  Selected  fluid  properties  for  FSI  simulations. 


2.  Fluid  Models 

The  FSI  simulations  performed  for  this  work  were  repeated  for  a  selected  combina¬ 
tion  of  fluid  and  structural  properties.  The  goal  is  to  obtain  insight  as  to  how  the  combined 
systems  would  behave  with  various  combinations  of  either  highly  viscous  and  dense  fluids 
like  honey  as  compared  to  fluids  of  very  low  viscosity  such  as  liquid  Mercury.  Though  this 
leads  to  an  admittedly  qualitative  analysis,  this  author  asserts  that  the  results  can  still  be 
useful  for  building  intuition  and  confirming  the  overall  efficacy  of  the  software  tools.  The 
relevant  properties  of  the  fluids  used  in  these  simulations  are  summarized  in  Table  7. 

3.  Converging-Diverging  Channel 

This  problem,  inspired  by  similar  work  reported  in  [74],  is  a  2D  FSI  problem  of 
flow  in  a  converging  and  diverging  duct.  A  schematic  of  the  problem  domain  is  shown  in 
Figure  47.  No-slip  boundary  conditions  are  used  on  the  rigid  portions  of  the  upper  and 
lower  boundary.  In  the  LBM  problem,  the  structure  is  modeled  as  a  moving  solid  domain 
and  in  the  FEM  structural  dynamics  problem  it  is  modeled  as  a  Euler-Bernoulli  beam  with 
clamped  boundary  conditions  on  both  ends. 
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Figure  47 :  Schematic  of  2D  converging  and  diverging  duct. 

A  series  of  simulations  were  run  with  glycerin  as  the  baseline  fluid;  chosen  for  its 
comparative  high  density  and  viscosity.  The  flexible  structure  is  composed  of  cork;  chosen 
for  its  comparative  low  density  and  low  modulus  of  elasticity.  The  flow  Reynolds  number 
is  set  at  5  based  on  inlet  width.  Results  for  displacement,  velocity  and  acceleration  at  the 
beam  midpoint  are  given  in  Figure  48  for  the  first  72  seconds  of  simulation  time.  Note  that 
no  damping  was  included  in  the  material  model,  but  that  nonetheless  the  beam  motion  is 
gradually  damped  to  a  constant  downward  displacement  as  expected.  The  relative  phases 
of  displacement,  velocity  and  acceleration  are  displayed  in  Figure  49. 

It  is  possible  to  gain  intuitive  insight  into  the  important  FSI  parameters  even  with 
this  comparatively  crude  implementation.  Consider  the  case  where  the  viscosity  of  the 
fluid  is  changed.  In  Figure  50,  a  comparison  can  be  made  for  beam  displacement  where 
the  fluid  viscosity  is  varied.  It  is  clear  that  fluid  viscosity  is  an  important  parameter  in 
ultimate  damping  of  beam  oscillations  where  more  viscous  liquids  offer  more  resistance  to 
structural  velocities.  We  do  a  similar  exercise  with  beam  elasticity  by  varying  the  elastic 
modulus.  Results  for  this  are  given  in  Figure  51.  The  more  pliant  material  is  damped  by 
the  fluid  more  quickly  than  the  relatively  stiff  beam. 
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Displacement  vs.  Time 
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Figure  48:  Converging  and  diverging  duet  displaeement,  veloeity  and  aeeeleration  at  beam 
midpoint.  Re  =  5,  glyeerin  with  eork  beam. 
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Figure  49:  Converging  and  diverging  duct  with  combined  beam  response.  Re=5,  glycerin 
with  cork  beam. 
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Displacement  vs.  Time 


Time  (sec) 


Displacement  vs.  Time 
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Figure  50:  Converging  and  diverging  duet  with  varying  fluid  viseosity.  Starting  from  top 
left,  fluid  viseosity  is  z/giycedn 
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Figure  51:  Converging  and  diverging  duet  with  varying  beam  elastie  modulus.  From  left  to 
right  elastie  modulus  is  and  F^cork- 
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4.  Lid-Driven  Cavity 

The  lid-driven  eavity  FSI  simulation  uses  the  geometry  illustrated  sehematieally  in 
Figure  52.  The  top  wall  will  be  modeled  using  the  Regularized  boundary  eondition  for 
imposed  veloeity,  the  bottom  wall  will  be  modeled  as  an  elastie  moving  boundary  and  the 
left  and  right  walls  will  be  modeled  as  no-slip  boundaries.  The  elastie  bottom  wall  will  use 
the  Euler-Bemoulli  beam  with  elamped  boundary  eonditions  on  both  ends. 


Moving  Wall 

- ► 


No  Slip  Wall 


Elastic  Boundary 


No  Slip  Wall 


Figure  52:  Sehematie  diagram  of  lid-driven  eavity  FSI  problem  geometry. 


The  results  for  the  simulation  are  presented  in  Figure  53.  The  streamlines  are  shown 
as  well  as  a  veetor  representation  of  final  displaeement.  A  plot  of  the  final  elastie  boundary 
displaeement  is  provided  in  Figure  54.  The  fluid  for  the  simulation  was  glyeerin  and  the 
material  used  for  the  elastie  boundary  was  PVC  with  material  properties  as  listed  in  Table 
6.  In  this  instanee,  Rayleigh  damping  was  applied  to  the  material  model  to  limit  spurious 
oseillatory  behavior  of  the  strueture.  This  damping  was  applied  by  seleeting  parameters  a 
and  in  Equation  5 1  where  for  this  equation  M  is  the  struetural  mass  matrix  and  K  is  the 
struetural  stiffness  matrix.  These  parameters  are  set  so  as  to  generate  a  damping  matrix  C 
that  generates  aeeeptable  beam  dynamie  behavior. 
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C  =  aM  +  /3K 


(51) 


Figure  53:  Results  for  two-dimensional  lid-driven  eavity. 
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Figure  54:  Final  bottom  displacement. 


5.  Cylinder  with  Fin  Benchmark 

For  this  example  simulation,  the  two-dimensional  variation  of  the  cylinder  with 
elastic  fin  benchmark  proposed  in  [75] .  The  inlet  boundary  condition  is  a  parabolic  velocity 
profile,  the  outlet  is  modeled  as  a  constant  pressure  boundary  condition.  The  top  and  bottom 
of  the  domain  is  modeled  as  no-slip  boundaries.  The  flow  Reynolds  number,  based  on 
cylinder  diameter,  is  set  to  200.  In  this  flow  condition,  periodic  vortex  shedding  is  expected 
from  the  top  and  bottom  of  the  cylinder. 


Figure  55:  Cylinder  with  elastic  fin  benchmark 
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The  resulting  pressure  fluetuations  from  the  vortex  shedding  imposes  a  periodie 
exeitation  on  the  elastie  fin.  The  elastie  fin  is  modeled  as  an  Euler-Bernoulli  beam.  The 
displaeement,  veloeity  and  aeeeleration  at  the  beam  tip  is  presented  in  Figure  56. 


Figure  56:  Results  for  two-dimensional  eylinder  with  elastie  trailing  fin  at  Re=200. 
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E.  FLUID-STRUCTURE  INTERACTION  IN  THREE  DIMENSIONS 

In  this  section,  the  cylinder  with  elastic  fin  benchmark  presented  previously  is  re¬ 
peated  in  three  dimensions.  The  fluid  is  represented  using  a  D3Q19  lattice  along  with  a 
MRT  collision  operator;  both  choices  made  in  the  interest  of  maximizing  fluid  simulation 
stability  while  minimizing  the  required  number  of  lattice  points.  The  structural  model  is 
composed  of  a  single  Mindlin-Reissner  plate.  A  detailed  formulation  of  this  plate  is  pre¬ 
sented  in  [73]. 

The  inlet  boundary  condition  is  a  parabolic  velocity  profile  and  the  outlet  boundary 
condition  is  constant  pressure.  No-slip  boundaries  are  established  on  the  upper  and  lower 
boundaries  of  the  domain  as  well  as  on  the  surface  of  the  rigid  cylinder.  The  inlet  velocity 
is  modeled  with  regularized  boundary  conditions  as  is  the  constant  pressure  outlet.  The 
elastic  fin  is  modeled  with  clamped  boundary  conditions  at  the  attachment  point  to  the 
cylinder  with  free  boundary  conditions  on  the  free  end.  The  surface  of  the  elastic  fin  is 
represented  in  the  fluid  domain  as  a  moving  boundary. 

In  order  to  ensure  representative  results,  an  initialization  phase  is  performed  where 
the  LBM  system  is  iterated  until  the  expected  time-periodic  flow  condition  is  established. 
This  condition  is  confirmed  by  sampling  the  vertical  velocity  component  in  the  channel 
center-line  downstream  of  the  cylinder/fin  obstacle;  a  stable  periodic  oscillation  indicates 
the  system  is  ready  to  initiate  FSI. 

Results  are  shown  in  Figure  57.  The  oscillation  frequency  of  the  beam  tip  is  3.8 
sec“^.  For  the  geometry  and  fluid  properties  used  in  this  simulation  this  corresponds  to 
a  Strouhal  number  of  0.19  which  matches  well  with  experimentally  measured  values  for 
vortex  shedding  for  flow  of  a  cyinder  with  Reynolds  number  equal  to  200. 

E  HETEROGENEOUS  PARALLEL  IMPLEMENTATION 

The  computational  requirements  for  three-dimensional  two-way  FSI  is  significant 
with  both  the  structural  and  fluid  models  requiring  a  large  amount  of  memory  and  processor 
resources.  The  GPU  used  for  this  research  had  sufficient  memory  resources  to  handle 
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Displacement  (m) 


Figure  57:  Displacement,  velocity  and  acceleration  for  cylinder  with  elastic  fin.  Re=200. 
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only  one  of  these  problems.  Therefore  all  of  the  FSI  examples  presented  in  this  work 
use  the  strategy  of  employing  the  GPU  for  the  LBM  simulation  while  relying  on  the  CPU 
for  the  struetural  dynamies  simulation.  The  rationale  for  this  ehoiee  is  that  in  eaeh  ease 
the  struetural  model  is  dimensionally  redueed  relative  to  the  fluid  model;  two-dimensional 
fluid  models  interaet  with  Euler-Bemoulli  beam  struetural  models  that  are  logieally  one¬ 
dimensional.  Similarly,  for  the  three-dimensional  FSI  example,  the  three-dimensional  fluid 
model  is  eoupled  with  a  struetural  model  that  uses  a  plate  that  is  logieally  two-dimensional. 
With  the  GPU  being  the  most  eapable  eomputational  deviee,  it  was  assigned  to  the  largest 
portion  of  the  work.  The  task-level  deeomposition  for  the  FSI  simulation  between  the  GPU 
and  CPU  is  illustrated  in  Figure  58. 


CPU  GPU  CPU 


CPU  GPU  CPU 


Figure  58:  Illustration  of  task- level  deeomposition  in  parallel  implementation  of  FSI  prob¬ 
lem.  In  the  lower  figure,  a  further  level  of  task-level  parallelism  is  exploited  by  overlapping 
the  struetural  dynamies  eomputation  on  the  CPU  with  LBM  ealeulations  on  domain  areas 
remote  from  the  elastie  strueture  on  the  GPU. 


Two  implementations  were  prepared  for  this  work.  In  the  first  ease,  the  fluid  domain 
was  eonsidered  as  a  whole  with  the  GPU  performing  an  LBM  time  step  on  the  entire  do¬ 
main  before  passing  pressure  boundary  eonditions  to  the  CPU  for  the  struetural  simulation. 
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When  the  CPU  eompleted  the  struetural  time-step,  veloeity  boundary  eonditions  are  passed 
baek  to  the  GPU.  In  the  seeond  ease,  the  fluid  domain  is  partitioned  as  illustrated  in  Figure 
59.  With  this  deeomposed  domain,  the  GPU  would  exeeute  an  LBM  time  step  in  a  region 
of  the  domain  immediately  surrounding  the  elastie  strueture.  When  the  time  step  is  eom- 
plete,  updated  fluid  forees  are  transferred  to  the  CPU  memory  so  that  the  struetural  time 
step  ean  be  exeeuted.  Coneurrently,  immediately  upon  passing  its  boundary  eonditions,  the 
GPU  eommenees  exeeution  of  the  LBM  time  step  on  the  remaining  fluid  domain.  In  this 
way,  the  overall  eomputation  time  is  redueed  by  exploiting  this  extra  level  of  eoneurreney. 
The  performanee  improvement  that  is  realized  by  using  this  strategy  is  highly  variable;  de¬ 
pending  both  on  the  problem  as  well  as  the  relative  performanee  eharaeteristies  of  the  GPU 
and  CPU  as  well  as  the  implementation  details  of  both  the  LBM  and  struetural  dynamies 
simulation  model.  Nonetheless,  for  this  researeh,  using  an  Intel  i5  quad-eore  CPU  running 
at  3.6  GHz  eombined  with  a  NVIDA  GTX-580,  a  performanee  improvement  of  23%  was 
obtained  using  this  proeedure  as  shown  in  Table  8. 


Pin  Envelope 


Domain  Remote 
from  Fin 


Figure  59:  Deeomposition  of  FSI  problem  domain  for  task-level  heterogeneous  parallelism. 
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Method 

Average  time  per  time-step  (sec) 

Percent  Speedup 

Non-Overlapped 

0.0305 

- 

Overlapped 

0.0234 

23% 

Table  8:  Performance  improvement  from  using  overlapped  execution  depicted  in  lower 
half  of  Figure  58  versus  the  non-overlapped  scheme  in  the  top  half  of  Figure  58.  The  lattice 
was  22  X  337  x  526  D3Q19  using  MRT  collision  operator.  The  structural  model  was  a 
sheer-deformable  plate  with  2883  degrees  of  freedom. 
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VI.  HYBRID  LATTICE  BOLTZMANN  METHOD 


A.  INTRODUCTION  AND  LITERATURE  REVIEW 

As  a  fluid  flow  solver,  the  LBM  is  particularly  effective  in  scenarios  where  con¬ 
ventional  methods  like  FDM,  CVM  and  FEM  have  difficulty.  Cases  with  highly  complex 
geometry  such  as  are  encountered  in  porous  flow  problems  [76]  where  it  is  difficult  or 
impractical  to  fully  discretize  the  domain  are  particularly  suitable  to  LBM.  Simulations 
involving  multi-component  or  multi-phase  flows  where  it  is  very  hard  to  accurately  model 
fluid  interfaces  and  to  capture  fluid  phase  and  component  interactions  are  also  examples 
where  LBM  has  found  good  use  [77]-[79]. 

Despite  these  advantages,  the  LBM  in  its  classical  formulation  (CLBM)  has  chal¬ 
lenges  of  its  own.  In  order  to  describe  the  transport  of  particle  velocity  distributions  through 
the  domain,  the  CLBM  calls  for  a  coupled  space  and  time  domain  discretization  that  is  re¬ 
alized  in  the  form  of  a  uniform  and  regular  set  of  lattice  points  across  which  particles 
“stream.”  Lor  practical  problems  which  often  contain  objects  with  curved  or  complex  sur¬ 
faces  it  may  be  necessary  to  use  a  highly  refined  lattice  to  describe  the  geometry.  This  often 
leads  to  non-complex  regions  of  the  domain  populated  with  a  needlessly  dense  lattice. 

Several  methods  have  been  developed  to  alleviate  this  issue  and  provide  greater 
geometric  flexibility  and  adaptability  to  the  lattice  mesh.  These  approaches  include  de¬ 
velopment  of  a  finite  volume  formulation  of  the  LBM  [80]  as  well  as  finite  difference 
formulations  [81],  [82],  multigrid  lattice  methods  [83]  finite  element  LBM  (LELBM)  [84], 
[74],  and  spectral  element  discontinuous  Galerkin  LBM  [85]-[87]  among  others.  These 
methods  obtain  this  benefit  of  geometric  flexibility  at  the  expense  of  a  relative  increase  in 
the  computational  effort  required  per  lattice  point  in  the  problem  domain. 

In  this  dissertation,  we  describe  a  hybrid  LBM  (HLBM)  where  the  LELBM  derived 
in  [74]  is  combined  with  the  CLBM  over  one  or  more  sub-domains.  This  new  method  ex¬ 
ploits  the  geometric  flexibility  of  the  LELBM  to  mesh  complex  surfaces  with  fewer  lattice 
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points  than  the  CLBM  while  leveraging  the  CLBM  to  model  geometrically  simple  regions 
of  the  problem  domain  with  greater  computational  efficiency.  The  result  is  a  method  that 
benefits  from  both  approaches. 

In  the  following  section,  the  derivation  of  FELBM  is  outlined.  Next,  the  coupling 
procedure  of  the  HLBM  is  illustrated  followed  by  numerical  results  and  conclusions. 

B.  FINITE  ELEMENT  LBM 

The  FELBM  borrows  almost  everything  from  the  CLBM  theory.  The  formulation 
and  implementation  can  be  summarized  as  very  similar  to  CLBM  with  the  exception  of  the 
“streaming”  procedure  at  the  end  of  each  time  step  which  the  FELBM  does  not  do.  Instead, 
an  equivalent  operator  is  developed  to  “advect”  the  particle  density  distribution  functions 
along  each  of  the  characteristic  lattice  directions.  This  feature  is  what  allows  the  FELBM 
to  use  a  non-uniform  grid  and  the  interpolative  nature  of  the  advection  operator  is  common 
among  non-classical  LBM  procedures. 

Starting  with  the  discrete  Boltzmann  equation,  we  expand  /„  within  a  space  of 
interpolating  polynomials  defined  on  finite  elements: 

n 

/a  =  E^T=[-ffl{/«}  (52) 

i=l 

where  n  is  the  number  of  nodes  in  an  element  as  well  as  the  number  of  interpolating  poly¬ 
nomials  H\ 

This  relation  is  substituted  into  Equation  8  using  the  BGK  collision  operator  of 
Equation  9  and  the  result  is  re-expressed  in  residual  form: 

fl  =  |//|^  +  e„  .  IVH]{U}  +  Im  ({/„}  -  {/„})  .  (53) 

Using  the  Galerkin  formulation  of  the  Method  of  Weighted  Residuals,  the  given  residual 
is  multiplied  by  the  interpolating  functions  if*  and  integrated  over  all  elemental  domains. 
This  results  in  the  following  set  of  linear  equations: 
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+  1A1{F„}  +  |C1{F„}  -  |C1{A}  =  0 


(54) 


where,  with  being  the  elemental  domain  for  the  e-th  finite  element,  the  linear  operators 
are  defined  as: 


[M]  =  V  /  [Hf[H]d^l  (55) 

e 

[K]=y^  I  [HY{e^  ■  [VH]dn  (56) 

e 

[C]  =  y2f  kHf[H]d^  (57) 

{F.}  =  (58) 

e 

This  is  now  a  first-order  differential  equation  in  time  whieh  deseribes  the  adveetion  of 
partiele  distribution  funetions  along  the  eharaeteristie  lattiee  direetions. 

Herein  lies  the  diffieulty  of  the  FELBM.  While  the  CLBM  is  able  to  aeeomplish  the 
partiele  “adveetion”  by  streaming  between  neighboring  lattiee  points  whieh  ean  be  effeeted 
by  simple  eopy  and  assignment  operations,  the  FELBM  must  aeeomplish  the  same  task 
by  evaluating  this  diserete  linear  adveetion  equation.  Whatever  time  integration  method 
is  used,  it  is  obvious  that  it  will  always  be  more  demanding  than  the  CLBM  streaming 
and,  unlike  the  logieal  streaming  method,  is  subjeet  to  the  usual  dissipative  and  dispersive 
approximation  errors  assoeiated  with  the  numerieal  time-integration  method  employed. 

Numerieal  dissipation  and  dispersion  ean  be  mitigated  by  redueing  time-step  size 
and  refining  the  spatial  mesh,  but  both  of  those  methods  require  more  eomputational  work. 
In  a  similar  way,  dipsersive  errors  ean  be  mitigated  by  using  higher-level  time  integration 
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schemes  such  as  explicit  multi-stage  Runge  Kutta  methods,  or  implicit  methods  such  as 
Backward  Euler,  Trapezoidal  or  Crank-Nicolson  techniques  but  these  methods  also  require 
additional  work. 

The  hybrid  method  introduced  in  the  next  section  seeks  to  reduce  the  additional 
computational  workload  associated  by  employing  the  FELBM  over  only  those  sub-domains 
where  the  geometric  flexibility  is  needed.  The  remainder  of  the  domain  is  modeled  with 
CEBM  and  a  coupling  procedure  is  introduced  to  allow  the  solution  to  proceed  concurrently 
on  all  sub-domains. 

C.  HYBRID  CLBM/FELBM  METHODOLOGY 

In  order  to  mitigate  the  computational  demands  of  the  FEEBM  while  retaining  the 
ability  to  model  a  domain  with  complex  and  irregular  shapes  without  an  unnecessarily 
dense  lattice,  the  hybrid  CEBM/FEEBM  (HEBM)  methodology  is  developed. 

All  of  the  theoretical  machinery  from  the  CEBM  and  EEEBM  formulations  are 
preserved  and  the  logical  sequence  of  computations  is  maintained  on  each  individual  sub- 
domain.  A  typical  HEBM  time-step  is  portrayed  schematically  in  Figure  60. 

To  couple  disjoint  sub-domains,  an  interface  layer  is  provided.  Computationally, 
the  streaming  process  of  the  CEBM  domain  and  the  advection  process  of  the  FEEBM 
domain  can  be  executed  concurrently  with  each  sub-domain  retaining  a  “halo”  of  depth 
1  into  the  adjoining  sub-domain.  Within  each  sub-domain,  the  outermost  layer  of  lattice 
points  represents  the  halo  and  as  shown  in  Figure  61  and  Figure  62. 

While  this  coupling  scheme  is  conceptually  very  simple,  great  care  must  be  taken 
with  the  time  and  space  integration  methods  used  for  advecting  particle  density  data  on  the 
FEEBM  sub-domain.  First-order  time  integration  schemes  tend  to  have  too  much  dissipa¬ 
tion  error  while  second  order  schemes  suffer  from  dispersion  errors;  both  of  which  prop¬ 
agate  onto  the  CEBM  sub-domain  and  impacts  solution  quality  and  stability  everywhere. 
For  the  results  discussed  in  this  paper,  simple  bi-linear  elements  are  used  for  the  spatial 
discretization  and  a  4-stage  3rd  order  explicit  time  integration  method  shown  in  Equation 
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Figure  60:  Schematic  of  Hybrid  LBM  time  step.  Methodology  differs  only  in  implementa¬ 
tion  of  the  particle  streaming  phase. 


CLBM  Interior  □  CLBM  Halo 


FELBM  Interior 


I  I  FELBM  Halo 


Figure  61:  Schematic  Hybrid  Lattice  on  regular  domain.  Assignment  following  streaming 
in  the  CLBM  domain  and  advection  in  the  FELBM  domain  is  only  made  to  the  interior 
of  each  respective  sub-domain.  Data  drawn  from  the  lattice  points  on  the  halo  facilitates 
communication  between  each  sub-domain. 
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Hybrid  Test  Patch 
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Figure  62:  Interface  region  for  CLBM  and  FELBM  domains  on  a  uniform  mesh.  Lattice 
points  with  both  the  asterisk  and  circle  belong  to  the  interface. 


59  is  used  for  temporal  integration.  This  method  effectively  controls  both  dissipation  and 
dispersion  errors  during  the  advection  process  and  allows  a  single  FELBM  advection  time 
step  for  every  CLBM  streaming  step.  The  CLBM  sub-domain  undergoes  the  classical  LBM 
streaming  to  adjacent  neighbors  restricted  to  only  destination  lattice  points  that  lay  within 
the  CLBM  interior  as  indicated  in  Equation  60. 


F{U)  =  LEM  advection  operator  on  FELBM  sub-domain  elements 

f/""  fOuC  I  FELBM  sub-domain 

^(1)  = 

JJ(2)  ^  u(l)  ^  ^ 

^(3)  _  2^n  ^  1^(2)  ^  ^ 

Ijn+l  ^  jj{3)  ^  ^ 


Un+1 


FELBM  interior 


(59) 
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fOut  I CLBM  sub-domain  Streaming  fin  I CLBM  interior  (60) 

Geometrically  simple  portions  of  the  domain  are  discretized  with  a  uniform,  regular 
lattice  for  the  CLBM.  Sub-domains  containing  complex  or  irregular  shapes  are  identified 
and  discretized  with  the  FELBM.  For  example,  in  simulations  such  as  those  requiring  fluid- 
structure  interaction,  in  the  case  of  flow  past  a  heat  exchanger  tube  bundle,  it  would  suffice 
to  employ  the  FELBM  only  in  a  region  around  the  actual  tubes.  This  area  could  employ  a 
mesh  with  isoparametric  elements  to  efficiently  describe  the  shape  of  the  tube  and  be  used 
with  the  FELBM,  while  the  remainder  of  the  domain  could  use  a  uniform,  regular  lattice 
and  the  CLBM.  The  resulting  HLBM  could  accurately  capture  the  flow  properties  while 
reducing  the  total  number  of  lattice  points  required  significantly. 

D.  NUMERICAL  RESULTS  AND  DISCUSSION 

In  all  numerical  examples  provided,  four-node  linear  isoparametric  elements  are 
used  for  FELBM  analysis  as  well  as  in  the  FELBM  sub-domains  of  a  hybrid  method.  For 
CLBM  analysis  and  in  the  associated  sub-domains  of  hybrid  models,  the  D2Q9  lattice  is 
used  with  the  single  relaxation  time  BGK  collision  operator.  The  linear  advection  equation 
on  the  FELBM  domain  is  solved  using  a  four-stage  third  order  Strong  Stability  Preserving 
Runge-Kutta  method.  This  more  robust  method  was  essential  to  minimize  the  impact  of 
diffusive  errors  arising  from  the  advection  operator  implemented  with  bi-linear  finite  ele¬ 
ment  methods.  Numerous  alternative  integration  schemes  are  possible  however  if  advective 
sub-steps  are  allowed  on  the  FELBM  sub-domain. 

As  the  first  example,  a  simple  Poiseuille  flow  problem  is  simulated  using  the  HLBM 
with  a  simple  20x50  lattice-point  FELBM  patch  embedded  in  a  200x50  lattice  point  CLBM 
domain.  The  goal  of  this  test  is  to  prove  the  effectiveness  of  the  coupling  procedure.  The 
simulation  in  all  cases  show  good  agreement  with  the  well-known  exact  solution  as  shown 
in  Figure  63. 
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Figure  63:  Mid-channel  normalized  velocity  profile  for  Poiseuille  flow  using  CLBM, 
FELBM  and  HLBM. 

The  next  numerical  test  is  channel  flow  with  a  circular  obstacle.  The  obstacle  size 
and  boundary  conditions  are  selected  such  that  a  flow  condition  corresponding  to  Reynolds 
number  of  5  is  achieved.  The  CLBM,  FELBM  and  HLBM  methods  are  employed  to  simu¬ 
late  the  fluid  flow,  but  a  regular,  uniform  lattice  is  maintained  in  all  cases.  The  obstruction 
is  placed  at  ^  where  L  is  the  length  of  the  channel. 

The  last  numerical  test  employs  a  mixed  mesh  around  the  same  circular  obstacle. 
A  schematic  of  the  mesh  used  in  the  vicinity  of  the  obstacle  is  shown  in  Figure  64.  As 
indicated  by  the  contour  plots  of  both  the  CLBM  and  HLBM  simulations,  both  schemes 
give  results  that  are  in  good  agreement  throughout  the  computational  domain. 
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Figure  64:  Hybrid  lattice  mesh  around  a  circular  obstacle.  Lattice  points  with  asterisk  are 
in  the  CLBM  sub-domain,  those  circled  are  in  the  FELBM  sub-domain.  Those  with  both 
markings  are  members  of  the  interface  halo  of  the  two  regions. 
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Velocity  Contour  Plot  (CLBM) 


Figure  65:  Normalized  velocity  contour  plot  for  fluid  flow  around  circular  obstacle  at  Reynolds  number  =  5  using  CLBM 


Velocity  Contour  Rot  (Hybrid) 


Figure  66:  Normalized  velocity  contour  plot  for  fluid  flow  around  circular  obstacle  at  Reynolds  number  =  5  using  Hybrid  LBM. 


Method 

Relative  Execution  time  per  Lattice  Point 

CLBM 

1.0 

FELBM 

7.9 

Table  9:  Performance  comparison  of  CLBM  and  FELBM. 


Additionally,  the  velocity  profile  is  compared  for  all  three  methods  at  different  chan¬ 
nel  positions.  The  performance  of  each  method  is  implementation  dependent,  but  for  the 


Figure  67:  Normalized  velocity  profile  at  30  percent  channel  length,  Reynolds  number  =  5. 

numerical  studies  done  in  support  of  this  research,  performance  results  are  listed  in  Table 
9. 

Roughly  speaking,  the  performance  of  CLBM  and  FELBM  are  independent  of 
problem  size.  Consequently  a  simulation  employing  HLBM  is  expected  in  general  to  ex¬ 
hibit  performance  characteristics  intermediate  between  CLBM  and  FELBM;  problems  with 
relatively  small  FELBM  sub-domains  will  result  in  execution  times  correspondingly  closer 
to  that  of  CLBM  while  the  opposite  is  true  for  larger  FELBM  sub-domains. 
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Figure  68:  Normalized  veloeity  profile  at  60  pereent  ehannel  length,  Reynolds  number  =  5. 

Problems  of  praetieal  interest  introduee  non-trivial  eomplexities  into  this  perfor- 
manee  analysis.  Applieations  using  HLBM  as  an  alternative  to  CLBM  are  expeeted  to 
utilize  signifieantly  fewer  lattiee  points  and  thus  require  signifieantly  less  time.  In  addi¬ 
tion  to  redueing  required  run-time  for  a  given  simulation,  this  feature  will  allow  for  a  more 
detailed  simulations  to  be  eondueted  on  a  workstation-size  eomputer. 

As  an  illustration  of  the  potential  benefit  for  eurved  or  irregular  boundaries,  eonsider 
the  diseretization  of  the  surfaee  of  the  eireular  obstruetion  eonsidered  in  this  dissertation  in 
Figure  69.  The  HLBM  eode  used  for  this  analysis  required  10,210  lattiee  points  for  both 
sub-domains.  Comparably  realistie  depletion  of  the  eurved  boundary  requires  signifieantly 
more  lattiee  points  in  a  uniform,  regular  lattiee  over  the  entire  domain. 
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Figure  69:  Schematic  Hybrid  Lattice  on  regular  domain.  Assignment  following  streaming 
in  the  CLBM  domain  and  advection  in  the  FELBM  domain  is  only  made  to  the  interior 
of  each  respective  sub-domain.  Data  drawn  from  the  lattice  points  on  the  halo  facilitates 
communication  between  each  sub-domain. 
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VII.  LBM  FOR  MULTI-COMPONENT  FLUIDS 


The  results  and  associated  discussion  presented  thus  far  in  this  work  have  focused 
solely  on  single-phase,  single-component  fluid  models.  One  of  the  strengths  of  LBM  lies 
in  its  natural  amenability  to  multicomponent  fluids.  In  this  chapter,  the  main  features  of 
the  leading  multi-component  LBM  fluid  models  will  be  outlined  and  example  applications 
will  be  presented  for  two-dimensional  single-phase  multi-component  flows. 

A.  MULTI-COMPONENT  FLUID  MODELS 

There  are  four  main  multi-component  fluid  models  in  LBM  theory.  The  color- 
fluid  model  [88],  the  inter-particle-potential  model  [89],  the  free-energy  model  [90]  and 
the  mean-field  theory  model  [91].  The  methods  can  all  be  compared  based  on  the  way  in 
which  the  surface  tension  of  the  component  interface  is  taken  into  account  in  the  evolution 
of  the  particle  distribution  functions  and  how  the  location  of  this  interface  is  determined. 
Excellent  surveys  of  multi-component  fluid  flows  are  provided  in  [14] -[16].  In  this  work, 
the  inter-particle  potential  model  is  used,  so  this  method  will  be  discussed  in  greater  detail; 
the  other  methods  will  only  be  briefly  reviewed. 

1.  Color-Fluid  Model 

The  color  fluid  model  was  introduced  with  the  publication  of  [88]  to  allow  the  sim¬ 
ulation  of  immiscible  binary  fluids  in  two  dimensions.  It  is  based  on  the  two-component 
cellular-automata  model  introduced  in  [92]  and  modified  for  use  with  LBM.  The  method 
is  referred  to  as  the  color-fluid  model  due  to  the  convention  of  referring  to  the  binary  fluid 
mixture  in  terms  of  “red”  particles  and  “blue”  particles.  The  LBM  is  carried  out  for  each 
fluid  species  and  the  surface-tension  effect  on  particle  distributions  is  emulated  with  an 
additional  perturbation  term  appended  to  the  collision  operator.  This  term,  in  combination 
with  a  “recoloring”  step — a  correction  based  on  the  local  color  gradient  that  forces  to  shift 
to  a  direction  leading  to  other  like-colored  particles — locally  places  the  interface  as  well 
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as  implements  the  surface-tension  effect  with  the  momentum  of  each  particle  distribution. 
This  method  is  frequently  criticized  in  the  literature  for  the  “artificial”  recoloring  process 
([16],  [93])  though,  as  will  be  seen,  each  of  the  multi-component  models  have  some  heuris¬ 
tic  elements  that  can  be  subjected  to  the  same  criticism;  the  resulting  spurious  velocities 
exhibited  in  the  vicinity  of  fluid  interfaces  is  common.  More  importantly,  the  recoloring 
step  is  executed  based  on  a  costly  and  time-consuming  calculation  of  local  color  gradients 
that  requires  considerably  more  information  sharing  between  neighboring  particles  than  for 
other  methods. 

2.  Free-Energy  Model 

The  free-energy  model  proposed  in  [90]  takes  a  different  approach.  Instead  of  main¬ 
taining  density  distributions  for  each  phase,  a  single  density  function  p  is  used  along  with  a 
density  difference  Ap  as  the  simulation  parameters.  Despite  the  terminology,  which  lends 
one  to  believe  that  the  method  was  intended  mainly  for  single-component  multi-phase  flow, 
the  method  was  originally  introduced  to  model  phase  separation  in  non-ideal  one-  and  two- 
component  fluids.  The  free-energy  model  gets  its  name  through  the  use  of  the  so-called 
Cahn-Hilliard’s  approach  for  non-equilibrium  thermodynamics  [94].  In  this  approach,  the 
form  of  the  pressure  tensor  is  defined  based  on  a  non-local  pressure  and  a  parametrized 
van  der  Waals  equation  of  state.  This  pressure  tensor  is  added  to  an  expanded  equilibrium 
distribution  function  which  produces  the  desired  interfacial  effect.  This  approach  has  been 
used  to  simulate  Rayleigh-Taylor  instability  [95],  bubble  motion  [96]  and  simulation  of 
spontaneous  emulsification  of  liquid  droplets  in  oil- water-surfactant  systems  [97]. 

3.  Mean-Field  Theory  Model 

The  mean  field  theory  was  introduced  in  [91]  for  non-ideal  gas  flow.  In  this  method 
two  distribution  functions  are  used.  The  first  distribution  function  is  used  to  calculate  the 
pressure  and  velocity  fields  of  an  incompressible  liquid.  The  second  is  an  index  function 
that  is  used  to  locate  the  interface.  The  model  is  so-named  because  the  inter-particle  in¬ 
teractions  are  treated  using  a  mean-field  approximation  in  the  same  way  that  the  Coulomb 
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interaction  among  charged  particles  of  a  plasma  is  treated  in  the  Vlasov  equation  [98], 
[99].  As  several  authors  have  pointed  out,  this  approach  is  similar  to  the  traditional  CFD 
methods  for  interface  capturing  and  is  the  LBM  analogy  to  the  level-set  [100]  and  volume 
of  fluid  methods  [101].  This  model  has  been  successfully  used  to  model  Rayleigh-Taylor 
and  Kelvin-Helmholtz  instabilities  [102],  [103]  with  non- ideal  dense  fluids  among  other 
applications. 

4.  Inter-Particle  Potential  Model 

The  inter-particle  potential  model  was  proposed  in  [89]  as  a  simple  means  to  simu¬ 
late  multi-phase  and  multi-component  fluids.  The  fundamental  idea  is  that  the  surface  ten¬ 
sion  effects  which  conventional  CFD  methods  attempt  to  account  for  in  multi-component 
flows  is  microscopic  in  origin;  the  same  effect  could  be  incorporated  into  the  LBM  via 
these  same  inter-particle  potential  forces.  In  this  model,  only  the  nearest  neighbor  particle 
densities  are  considered  and  they  are  introduced  as  follows  in  Equation  61: 

q-l 

F  (x,  t)  =  —Gip  (x,  t)  Walp  (x  -f  BaAt,  t)  Oq.  (61) 

q:=1 

where  x  is  particle  position,  G  is  a  parameter  indicating  interaction  strength,  ^(x,  t)  is  a 
function  describing  the  interaction  potential,  Wa  is  the  lattice  weight  for  direction  a  and 
Oq,  is  the  corresponding  lattice  velocity.  The  form  of  the  potential  function  can  be  varied 
to  obtain  the  desired  inter-particle  potential  behavior.  For  multiphase  flow,  tp  is  commonly 
expressed  as  in  Equation  62: 


tp  (p)  =  tpo  exp  .  (62) 

where  tpo  and  po  are  arbitrary  parameters  selected  so  as  to  achieve  appropriate  dynamics 
for  a  selected  fluid  system.  The  multicomponent  fluids  systems  used  in  this  work  are  only 
qualitatively  correlated  with  real  fluid  systems  in  that  only  density  and  viscosity  are  set  and 
scaled  consistently  with  the  EBM.  The  parameter  G  is  set  so  as  to  generate  a  desired  level 
of  immiscibility  while  maintaining  simulation  stability.  The  potential  function  is  set  as 
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O  Component  A 


^  Component  B 


Figure  70:  Illustration  of  inter-particle  forces  in  the  nearest  neighborhood  of  a  lattice  point. 

Ip  (x,  t)  =  p  (x,  t).  Notice  that  Equation  61  specifies  a  weighted  summation  of  the  value  of 
Ip  for  each  lattice  position  in  the  neighborhood  of  a  given  lattice  particle.  This  is  illustrated 
schematically  in  Figure  70. 

B.  IMMISCIBLE  MULTI-COMPONENT  LBM  PROCEDURES 

The  basic  LBM  process  with  multiple  components  is  similar  in  most  ways  to  that 
used  for  single-component  systems  as  illustrated  in  Figure  10.  The  obvious  difference  is 
that  there  are  now  two  sets  of  distribution  functions;  as  before  /„  and  for  a  second  com¬ 
ponent  that  will  conventionally  be  referred  to  as  pa-  Along  with  the  physical  domain,  both 
fluids  are  scaled  according  to  the  scaling  rules  discussed  in  Chapter  II.  A  second  difference 
is  that,  as  discussed  in  the  preceding  paragraphs,  the  inter-particle  force  must  be  calculated 
in  accordance  with  Equation  61  and  incorporated  into  the  calculation  of  macroscopic  ve¬ 
locity  used  for  computing  and  5'®'^  using  Equation  35  and  Equation  36.  The  biggest  and 
most  fundamental  difference  is  the  need  to  structure  the  computations  so  as  to  account  for 
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the  fact  that,  due  to  the  desired  macroscopic  dynamic  evolution  of  the  system,  some  lattice 
sites  will  have  zero  density  for  one  or  the  other  component;  only  the  interfaces  will  have  a 
significant  mixture. 

1.  Time  Stepping 

In  order  to  be  as  explicit  as  possible,  the  immiscible  multicomponent  LBM  time- 
step  for  fluid  lattice  points  is  carried  out  for  this  work  was  implemented  as  follows: 

1.  Compute  macroscopic  density  of  each  fluid: 

q-l 

a=0 

q-l 

Pg  =  9a 

a=0 

2.  Compute  macroscopic  momentum  of  each  fluid: 

q-l 

Pf^f  ~  ^afa 

a=0 

q-l 

Pg^g  ^  ^  Gofi'o 

a=0 

3.  Compute  a  weighted  combined  macroscopic  density  and  velocity: 

pu  =  pfUf  +  Pgug 

^  ^  P/U/CU/  +  PgUgUg 

pu 

where  we  recall  u  =  -. 

T 
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4.  Compute  inter-particle  force  using  Equation  61  setting  G  to  a  constant, 

and  'll)  (x,  t)  =  p: 

q-l 

Ff  =  -Gpg  ^  WaPg  (x  e„At,  t) 

a.=l 


q-l 


Fg  =  -Gpf  ^  Wo^pf  (x  e^At,  t) 


Q:=l 


5.  Apply  these  inter-particle  forces  as  momentum  inputs  to  each  respective 

lattice  population: 

^7  =  ^f-  ^/F/ 


=  11  —  T  F 

^9  ‘9^9 


6.  Complete  the  usual  LBGK  collision  and  streaming  steps  using  and 

and  corresponding  macroscopic  density  for  computation  of  and  g'^‘^ 
accordingly. 

2.  Boundary  Conditions 

For  this  work,  three  types  of  boundary  conditions  have  so  far  been  used:  periodic 
boundaries,  bounce-back  boundaries  and  moving  boundaries.  The  bounce-back  and  pe¬ 
riodic  boundary  conditions  for  multi-component  LBM  is  the  same  as  for  ordinary  LBM. 
Solid  nodes  bounce-back  by  swapping  the  density  distribution  function  across  opposite 
directions  as  illustrated  in  Figure  5  in  Chapter  II.  It  should  be  noted  that  the  density  dis¬ 
tribution  values  resident  on  solid  nodes  do  not  contribute  to  the  nearest-neighbor  force 
calculation  given  in  Equation  61.  The  moving  boundary  condition  is  employed  in  the  same 
way  as  with  single-component  fluid  models  using  the  weighted  macroscopic  velocity  given 
in  step  3  of  the  above  procedure. 
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c. 


EXAMPLE  APPLICATIONS 


In  order  to  get  a  qualitative  feel  for  how  different  immiseible  multieomponent  sys¬ 
tems  behave  using  the  LBM,  some  simple  eases  were  implemented  for  experimentation. 

I.  Component  Separation 

This  is  probably  the  simplest  possible  multi-eomponent  LBM  example.  The  domain 
is  periodie  in  both  spatial  dimensions  so  there  are  only  periodie  boundary  eonditions.  The 
initial  eondition  is  a  random  distribution  between  the  phases  with  eaeh  lattiee  point  set  with 
a  slightly  higher  or  lower  proportion  of  eaeh  fluid.  The  parameter  G  is  set  to  -1.2  to  model 
strongly  repulsive  eomponents  and  the  simulation  is  set  to  run.  The  results  for  seleeted  time 
steps  is  presented  in  Figure  7 1 . 


Initial  Condition  T  =  500  T  =  5000 


0.999  1.001  0.999  1.001 


Figure  71:  Two  immiseible  eomponents.  G  =  -1.2 

This  behavior  ean  be  eompared  with  less  strongly  repulsive  eomponents  by  setting 
the  parameter  G  to  -0.2.  The  results  for  this  simulation  are  presented  in  Figure  72.  In  this 
ease,  almost  no  eomponent  separation  oeeurs  and  the  two  eomponents  as  modeled  are  fully 
miseible. 

2.  Lid-Driven  Cavity 

In  order  to  obtain  a  multieomponent  simulation  with  reeognizable  dynamies,  the 
lid-driven  eavity  problem  is  used  as  a  model.  For  this  problem  two  immiseible  fluids  are 
initially  eonfigured  with  fluid  1  (eorresponding  to  /  density  distributions)  at  the  top  half 
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Initial  Condition  Final  Condition 


0,999  1.001  0.999  1,001 


Figure  72:  Two  immiscible  components.  G  =  -0.2.  Note  that  the  weak  interaction  parameter 
renders  the  fluids  miscible. 

of  the  domain  and  fluid  2  (corresponding  to  g  density  distribution  functions)  at  the  bottom 
half  of  the  domain  as  shown  in  Figure  73.  Two  cases  will  be  considered  for  this  problem: 

1 .  Lid-driven  cavity  with  two  immiscible  fluids  of  equal  density  and  viscos¬ 
ity. 

2.  Lid-driven  cavity  with  two  immiscible  fluids  with  pi  =  y  and  z/i  =  ^. 
Additionally,  a  gravitational  body  force,  as  described  in  Chapter  II.E,  will 
be  added  to  accentuate  the  significance  of  the  differing  densities. 

a.  Case  1 

As  specified  above,  the  fluid  density  and  viscosity  are  set  to  be  equal;  the 
only  difference  is  the  repulsion  effect.  As  the  flow  develops,  the  combined  forces  of  the  lid, 
which  is  driving  the  flow,  and  the  inter-particle  repulsive  forces  take  effect  on  the  particle 
density  distributions.  For  these  strongly  repulsive  fluids  of  equal  density  and  viscosity,  the 
result  for  Fluid  1  is  as  we  would  intuitively  expect  and  is  shown  in  Figure  74. 
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Figure  73:  Density  for  Fluid  1  of  a  two-eomponent  immiseible  fluid  flow  in  a  lid-driven 
eavity.  Initial  eonfiguration. 


b.  Case  2 

Here,  the  fluid  properties  are  not  equal  and  additionally,  there  is  a  gravita¬ 
tional  body  foree  applied  to  both  fluids.  The  system  begins  in  the  same  initial  eonfiguration 
as  shown  for  Case  1 .  Time  evolution  of  the  density  for  the  two-eomponent  system  is  pre¬ 
sented  in  Figure  75.  The  momentum  evolution  for  Fluid  1  is  presented  in  Figure  76. 

3.  Lid-Driven  Cavity  with  FSI 

In  order  to  investigate  the  applieation  of  FSI  tools  to  multi-eomponent  flow,  a  simple 
problem  is  demonstrated.  The  problem  domain  is  depieted  sohematieally  in  Figure  77.  A 
vertieal  elastie  beam  is  ineluded  in  a  lid-driven  eavity.  The  length  of  the  beam  is  equal 
to  one-third  the  eavity  depth  and  it  is  modeled  as  an  Euler-Bernoulli  beam  with  a  elamped 
boundary  where  the  beam  interseets  with  the  bottom  of  the  eavity  and  free  on  the  other  end. 
Proportional  damping  was  applied  to  the  beam  to  prevent  exeessive  oseillations  that  would 
inhibit  a  stable  fluid  simulation.  The  initial  fluid  eondition  is  the  same  as  the  previous 
problem,  with  fluid  1  on  the  top  half  and  fluid  2  on  the  bottom  half.  The  lid  is  set  to  a 
eonstant  speed  so  as  to  generate  a  flow  field  eorresponding  to  Re=1000  based  on  the  eavity 
width. 
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Figure  74:  Fluid  1  results  for  a  two-eomponent  immiseible  fluid  flow  in  a  lid-driven  eavity. 
Sequenee  of  images  shows  flow  progression  from  top  to  bottom  eorresponding  respeetively 
to  early  in  the  simulation  to  its  final  steady  state. 
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0.00011  1.06958  0.00011  1.06958 


Figure  75:  Fluid  1  results  for  a  two-eomponent  immiseible  fluid  flow  in  a  lid  driven  eavity. 
The  higher  viseosity  and  density  of  fluid  2  results  in  its  eonfinement  in  the  bottom-half  of 
the  eavity  domain. 


l.Ole-7  0.10351  7.66e-6  0.09028 


Figure  76:  Fluid  1  results  for  a  two-eomponent  immiseible  fluid  flow  in  a  lid  driven  eavity. 
The  higher  viseosity  and  density  of  fluid  2  results  in  its  eonfinement  in  the  bottom-half  of 
the  eavity  domain. 
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Figure  77:  Schematic  representation  of  a  lid-driven  cavity  with  an  elastic  beam  attached  to 
the  lower  surface. 

The  fluid  parameters  used  for  this  simulation  are  identical  to  those  used  for  Case  1 
above.  The  fluids  both  have  the  same  density  and  viscosity  but  a  repellent  interpartical  po¬ 
tential  promotes  the  immiscibility  of  the  binary  fluid.  A  single-component  fluid  simulation 
was  carried  out  with  the  problem  geometry  and  boundary  conditions  in  order  to  develop 
some  intuition  for  what  flow  conditions  are  expected  within  the  cavity.  The  results  of  this 
computation  are  presented  in  Figure  78 

The  FSI  tools  and  procedures  were  used  as  in  the  previous  section  with  the  modifi¬ 
cation  that  the  fluid  forces  were  computed  as  a  sum  of  forces  due  to  each  fluid  component. 
Similarly,  the  velocity  boundary  condition  was  inserted  as  a  momentum  input  to  both  flu¬ 
ids  using  the  moving  surface  boundary  condition.  As  with  the  previous  simulations,  no 
accounting  was  made  for  material  or  geometric  nonlinearities.  Consequently,  the  simula¬ 
tion  was  arranged  so  that  displacements  would  be  small  in  comparison  to  the  underlying 
discretization  in  both  the  fluid  and  solid  domains. 

The  final  momentum  and  density  of  fluid  1  is  presented  in  Figure  79.  As  expected, 
the  fluid  initially  in  the  upper  half  of  the  cavity — fluid  1 — is  ultimately  concentrated  in  the 
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Figure  78:  Single-component  fluid  flow  in  cavity  with  beam.  Streamlines  show  develop¬ 
ment  of  three  distinct  vortex  regions. 

main  vortex  in  the  upper  right-hand  portion  of  the  cavity.  There  is  a  smaller  vortex  region 
in  the  lower  right  corner  where  fluid  1  and  fluid  2  circulate  and  mix.  The  large  bean-shaped 
circulating  region  on  the  left  portion  of  the  cavity  is  a  mixture  of  both  fluids  as  well,  with 
fluid  2  concentrating  in  the  lower  left  region  where  it  is  trapped  against  the  beam  and  the 
lower  domain  boundary. 

The  resulting  displacement,  velocity  and  acceleration  of  the  beam  tip  is  presented 
in  Figure  80.  A  depiction  of  the  final  displacement  of  the  entire  beam  is  presented  in  Figure 
81. 
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Figure  79:  Momentum  and  density  fields  for  fluid  1  at  steady-state;  Re=1000. 


Figure  80:  Plot  of  displacement,  velocity  and  acceleration  at  the  tip  of  the  elastic  beam. 
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Figure  81:  Final  beam  displacement  for  multi-component  FSI.  Beam  displacement  is  mag¬ 
nified  10  times  for  clarity. 
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VIII.  CONCLUSIONS  AND  FUTURE  WORK 


A.  CONCLUSIONS 

The  primary  objective  of  this  thesis  is  to  investigate  the  use  of  LBM  to  model  vis¬ 
cous  incompressible  flow  and  model  the  interactions  that  flow  has  with  surrounding  struc¬ 
tures.  In  accomplishing  this  goal,  several  outcomes  have  been  achieved: 

1.  A  body  of  software  has  been  developed  to  allow  LBM  modeling  of  single¬ 
component  incompressible  flow  in  two  and  three  dimensions. 

(a)  This  software  was  validated  against  a  selection  of  two-  and  three- 
dimensional  benchmark  problems  and  shown  to  accurately  model 
low  and  moderate  Reynolds  number  fluid  flows 

(b)  It  was  shown  that  the  expected  quadratic  convergence  properties 
of  LBM  can  be  lost  for  single-precision  computations  and  a  new 
mixed-precision  implementation  was  demonstrated  to  extend  the 
range  of  quadratic  convergence  at  less  computational  expense  than 
carrying  out  computations  in  full  double  precision. 

(c)  The  software  performance  was  compared  to  other  single-GPU 
implementations  reported  in  the  recent  literature  and  it  was  shown 
that  the  implementation  presented  for  this  thesis  outperforms  all 
others  over  certain  problem  sizes  and  is  highly  competitive  for  all 
problem  sizes. 

2.  The  LBM  software  tools  were  extended  to  incorporate  structural  interac¬ 
tions  thus  allowing  FSI  simulations. 

(a)  FSI  simulations  were  demonstrated  for  both  two-  and  three-dimensional 
single-component  fluid  flow  problems. 
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(b)  A  novel  heterogeneous  parallel  implementation  was  described 
where  task-level  parallelism  inherent  in  the  FSI  problem  is  parceled 
among  both  the  CPU  and  GPU  on  a  single  workstation.  It  was 
shown  that  by  decomposing  the  problem  domain  that  performance 
can  be  improved  by  over  23  percent  for  some  problems. 

(c)  A  simple  two-dimensional  FSI  simulation  was  presented  for  multi- 
component  fluid  flows. 

3.  A  novel  method  for  combining  CLBM  and  FELBM  into  a  HBLM  was  de¬ 

veloped.  This  allows  the  efficient  description  of  curved  domain  boundaries 
while  minimizing  the  computational  expense  of  FELBM. 

B.  FUTURE  WORK 

Much  work  remains  to  be  done  for  future  research  in  LBM  for  fluid  flows.  Eor 
single-component  fluid  flows,  the  software  tools  presented  in  this  work  only  accommodate 
flows  for  Reynolds  number  up  to  approximately  10,000.  Current  research  in  entropic  and 
thermal  LBM  along  with  single  and  multi-parameter  turbulence  models  can  be  incorporated 
to  extend  the  range  of  simulation  up  through  at  least  the  range  of  incompressible  fluid 
flow.  Eurther  research  needs  to  be  done  to  include  some  capability  for  compressible  flow 
modeling.  Accomplishment  of  these  goals  will  greatly  expand  the  range  of  applicability  of 
the  LBM  models  presented  here  and  open  the  door  to  a  multitude  of  interesting  research 
applications. 

The  multi-component  fluid  flow  modeling  capability  is  also  in  need  of  much  work. 
Current  implementations  are  restrictive  in  the  range  of  fluid  density  and  viscosity  ratios 
that  can  be  stably  simulated.  Additionally,  more  physical  fidelity  is  required  in  the  inter¬ 
particle  potential  modeling  to  allow  predictive  rather  than  merely  qualitative  simulations  to 
be  conducted. 

Eor  successful  application  of  improved  LBM  models  to  more  realistic  problem 
types,  the  need  exists  to  further  extend  the  performance  and  scalability  of  the  software 
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tools.  As  implemented  for  this  work,  exeellent  performanee  was  obtained  for  software  run 
on  a  single  GPU  or  a  small  number  of  GPUs  attaehed  to  a  single  workstation.  The  seal¬ 
ing  aehieved  for  multiple  GPUs  over  multiple  platforms  was  less  well  developed  for  this 
work.  To  aeeommodate  future  sealing,  improved  partitioning  strategies  must  be  ineluded 
and  ineorporated  into  the  software  platform  in  a  way  that  allows  effieient  sealing  to  an  ar¬ 
bitrarily  large  set  of  eomputational  resourees.  Additionally,  software  eomponents  sueh  as 
solid  modelers  and  mesh  generators  must  be  adopted  and  interfaeed  with  existing  tools  to 
assist  with  LBM  problem  formulation  and  setup. 

In  order  to  employ  lattiee  Boltzmann  fluid  models  for  future  FSI  researeh  on  more 
physieally  relevant  problems,  extensive  work  is  also  required  for  both  the  struetural  models 
as  well  as  the  LBM  eomputational  routines.  Of  highest  priority  for  the  struetural  model  is 
the  need  to  ineorporate  geometrie  and  material  non-linearity  into  the  dynamie  models.  FSI 
problems  of  praetieal  interest  involve  deformations  that  are  not  small  and  strains  that  are 
large.  Consequently,  linear  elastic  structural  dynamics  models  are  overly  restrictive  in  the 
range  of  problems  for  which  they  will  provide  a  satisfactory  answer.  Future  research  efforts 
will  be  directed  towards  meeting  this  need  with  a  full  range  of  non-linear  material  constitu¬ 
tive  models  and  FEM  technology  to  accommodate  moving  and  deforming  meshes.  On  the 
fluid  modeling  front,  it  is  imperative  that  extensions  be  developed  to  admit  moving  solid 
boundaries  that  displace  over  an  arbitrary  number  of  lattice  points.  Methods  must  be  imple¬ 
mented  to  properly  initialize  and  update  lattice  points  that  are  either  entering  or  emerging 
a  non-fluid  domain.  Alternately,  non-classical  LBM  formulations  such  as  FELBM  and 
HLBM  must  be  further  developed  so  that  both  the  fluid  and  solid  domain  can  be  consis¬ 
tently  described  in  a  Lagrangian  manner  within  a  given  ESI  simulation  over  a  more  broad 
range  of  problems. 
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